Date post: | 16-Jul-2016 |
Category: |
Documents |
Upload: | kobarseptyanus |
View: | 110 times |
Download: | 2 times |
1. Introduction2. Booting
i. Frombootloadertokernelii. Firststepsinthekernelsetupcodeiii. Videomodeinitializationandtransitiontoprotectedmodeiv. Transitionto64-bitmodev. Kerneldecompression
3. Initializationi. Firststepsinthekernelii. Earlyinterruptshandleriii. Lastpreparationsbeforethekernelentrypointiv. Kernelentrypointv. Continuearchitecture-specificboot-timeinitializationsvi. Architecture-specificinitializations,again...vii. Endofthearchitecture-specificinitializations,almost...viii. Schedulerinitializationix. RCUinitializationx. Endofinitialization
4. Interruptsi. Introductionii. Starttodiveintointerruptsiii. Interrupthandlersiv. Initializationofnon-earlyinterruptgatesv. Implementationofsomeexceptionhandlersvi. HandlingNon-Maskableinterruptsvii. Diveintoexternalhardwareinterruptsviii. Initializationofexternalhardwareinterruptsstructuresix. Softirq,TaskletsandWorkqueuesx. Lastpart
5. Systemcallsi. Introductiontosystemcallsii. HowtheLinuxkernelhandlesasystemcalliii. vsyscallandvDSOiv. HowtheLinuxkernelrunsaprogram
6. Timersandtimemanagementi. Introductionii. Clocksourceframework
7. Memorymanagementi. Memblockii. Fixmapsandioremap
8. SMP9. Concepts
i. Per-CPUvariablesii. Cpumasks
10. DataStructuresintheLinuxKerneli. Doublylinkedlistii. Radixtree
11. Theoryi. Paging
TableofContents
LinuxInside
2
ii. Elf64iii. CPUIDiv. MSR
12. Initialramdiski. initrd
13. Misci. Howthekerneliscompiledii. Linkersiii. Linuxkerneldevelopmentiv. WriteandSubmityourfirstLinuxkernelPatchv. Datatypesinthekernel
14. Usefullinks15. Contributors
LinuxInside
3
Aseriesofpostsaboutthelinuxkernelanditsinsides.
Thegoalissimple-tosharemymodestknowledgeabouttheinternalsofthelinuxkernelandhelppeoplewhoareinterestedinlinuxkernelinternals,andotherlow-levelsubjectmatter.
Questions/Suggestions:Feelfreeaboutanyquestionsorsuggestionsbypingingmeattwitter@0xAX,addinganissueorjustdropmeanemail.
SupportIfyoulikelinux-insidesyoucansupportmewith:
ChineseSpanish
LicensedBY-NC-SACreativeCommons.
Feelfreetocreateissuesorpull-requestsifyouhaveanyproblems.
PleasereadCONTRIBUTING.mdbeforepushinganychanges.
linux-insides
Support
Onotherlanguages
LICENSE
Contributions
LinuxInside
4Introduction
Thischapterdescribesthelinuxkernelbootprocess.Youwillseehereacoupleofpostswhichdescribethefullcycleofthekernelloadingprocess:
Fromthebootloadertokernel-describesallstagesfromturningonthecomputertorunningthefirstinstructionofthekernel;Firststepsinthekernelsetupcode-describesfirststepsinthekernelsetupcode.Youwillseeheapinitialization,queryofdifferentparameterslikeEDD,ISTandetc...Videomodeinitializationandtransitiontoprotectedmode-describesvideomodeinitializationinthekernelsetupcodeandtransitiontoprotectedmode.Transitionto64-bitmode-describespreparationfortransitioninto64-bitmodeanddetailsoftransition.KernelDecompression-describespreparationbeforekerneldecompressionanddetailsofdirectdecompression.
Kernelbootprocess
LinuxInside
6Booting
Ifyouhavereadmypreviousblogposts,youcanseethatsometimeagoIstartedtogetinvolvedwithlow-levelprogramming.Iwrotesomepostsaboutx86_64assemblyprogrammingforLinux.Atthesametime,IstartedtodiveintotheLinuxsourcecode.Ihaveagreatinterestinunderstandinghowlow-levelthingswork,howprogramsrunonmycomputer,howtheyarelocatedinmemory,howthekernelmanagesprocessesandmemory,howthenetworkstackworksonlow-levelandmanymanyotherthings.So,IdecidedtowriteyetanotherseriesofpostsabouttheLinuxkernelforx86_64.
NotethatI'mnotaprofessionalkernelhackerandIdon'twritecodeforthekernelatwork.It'sjustahobby.Ijustlikelow-levelstuff,anditisinterestingformetoseehowthesethingswork.Soifyounoticeanythingconfusing,orifyouhaveanyquestions/remarks,pingmeontwitter0xAX,dropmeanemailorjustcreateanissue.Iappreciateit.Allpostswillalsobeaccessibleatlinux-insidesandifyoufindsomethingwrongwithmyEnglishorthepostcontent,feelfreetosendapullrequest.
Notethatthisisn'ttheofficialdocumentation,justlearningandsharingknowledge.
Requiredknowledge
UnderstandingCcodeUnderstandingassemblycode(AT&Tsyntax)
Anyway,ifyoujuststarttolearnsometools,Iwilltrytoexplainsomepartsduringthisandthefollowingposts.Ok,simpleintroductionfinishesandnowwecanstarttodiveintothekernelandlow-levelstuff.
Allcodeisactuallyforkernel-3.18.Iftherearechanges,Iwillupdatethepostsaccordingly.
DespitethatthisisaseriesofpostsabouttheLinuxkernel,wewillnotstartfromthekernelcode(atleastnotinthisparagraph).Ok,youpressthemagicpowerbuttononyourlaptopordesktopcomputeranditstartestowork.Afterthemotherboardsendsasignaltothepowersupply,thepowersupplyprovidesthecomputerwiththeproperamountofelectricity.Oncethemotherboardreceivesthepowergoodsignal,ittriestostarttheCPU.TheCPUresetsallleftoverdatainitsregistersandsetsuppredefinedvaluesforeachofthem.
80386andlaterCPUsdefinethefollowingpredefineddatainCPUregistersafterthecomputerresets:
IP0xfff0
CSselector0xf000
CSbase0xffff0000
Theprocessorstartsworkinginrealmode.Let'sbackupalittletotryandunderstandmemorysegmentationinthismode.Realmodeissupportedonallx86-compatibleprocessors,fromthe8086allthewaytothemodernIntel64-bitCPUs.The8086processorhasa20-bitaddressbus,whichmeansthatitcouldworkwith0-2^20bytesaddressspace(1megabyte).Butitonlyhas16-bitregisters,andwith16-bitregistersthemaximumaddressis2^16or0xffff(64kilobytes).Memorysegmentationisusedtomakeuseofalltheaddressspaceavailable.Allmemoryisdividedintosmall,fixed-sizesegmentsof65535bytes,or64KB.Sincewecannotaddressmemoryabove64KBwith16bitregisters,analternatemethodisdevised.Anaddressconsistsoftwoparts:thebeginningaddressofthesegmentandanoffsetfromthisaddress.Togeta
Kernelbootingprocess.Part1.
Fromthebootloadertokernel
TheMagicPowerButton,Whathappensnext?
LinuxInside
7Frombootloadertokernel
physicaladdressinmemory,weneedtomultiplythesegmentpartby16andaddtheoffsetpart:
PhysicalAddress=Segment*16+Offset
ForexampleifCS:IPis0x2000:0x0010,thecorrespondingphysicaladdresswillbe:
>>>hex((0x2000<<4)+0x0010)
'0x20010'
Butifwetakethelargestsegmentpartandoffset:0xffff:0xffff,itwillbe:
>>>hex((0xffff<<4)+0xffff)
'0x10ffef'
whichis65519bytesoverfirstmegabyte.Sinceonlyonemegabyteisaccessibleinrealmode,0x10ffefbecomes0x00ffefwithdisabledA20.
Ok,nowweknowaboutrealmodeandmemoryaddressing.Let'sgetbacktodiscussaboutregistervaluesafterreset:
CSregisterconsistsoftwoparts:thevisiblesegmentselectorandhiddenbaseaddress.WeknowpredefinedCSbaseandIPvalue,sothelogicaladdresswillbe:
0xffff0000:0xfff0
ThestartingaddressisformedbyaddingthebaseaddresstothevalueintheEIPregister:
>>>0xffff0000+0xfff0
'0xfffffff0'
Weget0xfffffff0whichis4GB-16bytes.ThispointiscalledtheResetvector.ThisisthememorylocationatwhichtheCPUexpectstofindthefirstinstructiontoexecuteafterreset.ItcontainsajumpinstructionwhichusuallypointstotheBIOSentrypoint.Forexample,ifwelookinthecorebootsourcecode,wesee:
.section".reset"
.code16
.globlreset_vector
reset_vector:
.byte0xe9
.int_start-(.+2)
...
Herewecanseethejmpinstructionopcode-0xe9anditsdestinationaddress-_start-(.+2),andwecanseethattheresetsectionis16bytesandstartsat0xfffffff0:
SECTIONS{
_ROMTOP=0xfffffff0;
.=_ROMTOP;
.reset.:{
*(.reset)
.=15;
BYTE(0x00);
}
LinuxInside
8Frombootloadertokernel
}
NowtheBIOSstarts:afterinitializingandcheckingthehardware,itneedstofindabootabledevice.AbootorderisstoredintheBIOSconfiguration,controllingwhichdevicesthekernelattemptstobootfrom.Whenattemptingtobootfromaharddrive,theBIOStriestofindabootsector.OnharddrivespartitionedwithanMBRpartitionlayout,thebootsectorisstoredinthefirst446bytesofthefirstsector(whichis512bytes).Thefinaltwobytesofthefirstsectorare0x55and0xaa,whichsignalstheBIOSthatthisdeviceisbootable.Forexample:
;
;Note:thisexampleiswritteninIntelAssemblysyntax
;
[BITS16]
[ORG0x7c00]
boot:
moval,'!'
movah,0x0e
movbh,0x00
movbl,0x07
int0x10
jmp$
times510-($-$$)db0
db0x55
db0xaa
Buildandrunitwith:
nasm-fbinboot.nasm&&qemu-system-x86_64boot
ThiswillinstructQEMUtousethebootbinarywejustbuiltasadiskimage.Sincethebinarygeneratedbytheassemblycodeabovefulfillstherequirementsofthebootsector(theoriginissetto0x7c00,andweendwiththemagicsequence),QEMUwilltreatthebinaryasthemasterbootrecord(MBR)ofadiskimage.
Youwillsee:
LinuxInside
9Frombootloadertokernel
Inthisexamplewecanseethatthecodewillbeexecutedin16bitrealmodeandwillstartat0x7c00inmemory.Afterstartingitcallsthe0x10interruptwhichjustprintsthe!symbol.Itfillstherestofthe510byteswithzerosandfinisheswiththetwomagicbytes0xaaand0x55.
Youcanseeabinarydumpofthiswiththeobjdumputil:
nasm-fbinboot.nasm
objdump-D-bbinary-mi386-Maddr16,data16,intelboot
Areal-worldbootsectorhascodetocontinuethebootprocessandthepartitiontableinsteadofabunchof0'sandanexclamationmark:)Fromthispointonwards,BIOShandsovercontroltothebootloader.
NOTE:AsyoucanreadabovetheCPUisinrealmode.Inrealmode,calculatingthephysicaladdressinmemoryisdoneasfollowing:
PhysicalAddress=Segment*16+Offset
Thesameasmentionedbefore.Wehaveonly16bitgeneralpurposeregisters,themaximumvalueofa16bitregisteris0xffff,soifwetakethelargestvalues,theresultwillbe:
>>>hex((0xffff*16)+0xffff)
'0x10ffef'
Where0x10ffefisequalto1MB+64KB-16b.Buta8086processor,whichisthefirstprocessorwithrealmode,hasa20bitaddresslineand2^20=1048576is1MB.Thismeanstheactualmemoryavailableis1MB.
Generalrealmode'smemorymapis:
0x00000000-0x000003FF-RealModeInterruptVectorTable
0x00000400-0x000004FF-BIOSDataArea
LinuxInside
10Frombootloadertokernel
0x00000500-0x00007BFF-Unused
0x00007C00-0x00007DFF-OurBootloader
0x00007E00-0x0009FFFF-Unused
0x000A0000-0x000BFFFF-VideoRAM(VRAM)Memory
0x000B0000-0x000B7777-MonochromeVideoMemory
0x000B8000-0x000BFFFF-ColorVideoMemory
0x000C0000-0x000C7FFF-VideoROMBIOS
0x000C8000-0x000EFFFF-BIOSShadowArea
0x000F0000-0x000FFFFF-SystemBIOS
InthebeginningofthispostIwrotethatthefirstinstructionexecutedbytheCPUislocatedataddress0xFFFFFFF0,whichismuchlargerthan0xFFFFF(1MB).HowcantheCPUaccessthisinrealmode?Thisisinthecorebootdocumentation:
0xFFFE_0000-0xFFFF_FFFF:128kilobyteROMmappedintoaddressspace
Atthestartofexecution,theBIOSisnotinRAM,butinROM.
ThereareanumberofbootloadersthatcanbootLinux,suchasGRUB2andsyslinux.TheLinuxkernelhasaBootprotocolwhichspecifiestherequirementsforbootloaderstoimplementLinuxsupport.ThisexamplewilldescribeGRUB2.
NowthattheBIOShaschosenabootdeviceandtransferredcontroltothebootsectorcode,executionstartsfromboot.img.Thiscodeisverysimpleduetothelimitedamountofspaceavailable,andcontainsapointerwhichisusedtojumptothelocationofGRUB2'scoreimage.Thecoreimagebeginswithdiskboot.img,whichisusuallystoredimmediatelyafterthefirstsectorintheunusedspacebeforethefirstpartition.Theabovecodeloadstherestofthecoreimageintomemory,whichcontainsGRUB2'skernelanddriversforhandlingfilesystems.Afterloadingtherestofthecoreimage,itexecutesgrub_main.
grub_maininitializestheconsole,getsthebaseaddressformodules,setstherootdevice,loads/parsesthegrubconfigurationfile,loadsmodulesetc.Attheendofexecution,grub_mainmovesgrubtonormalmode.grub_normal_execute(fromgrub-core/normal/main.c)completesthelastpreparationandshowsamenutoselectanoperatingsystem.Whenweselectoneofthegrubmenuentries,grub_menu_execute_entryruns,whichexecutesthegrubbootcommand,bootingtheselectedoperatingsystem.
Aswecanreadinthekernelbootprotocol,thebootloadermustreadandfillsomefieldsofthekernelsetupheader,whichstartsat0x01f1offsetfromthekernelsetupcode.Thekernelheaderarch/x86/boot/header.Sstartsfrom:
.globlhdr
hdr:
setup_sects:.byte0
root_flags:.wordROOT_RDONLY
syssize:.long0
ram_size:.word0
vid_mode:.wordSVGA_MODE
root_dev:.word0
boot_flag:.word0xAA55
Thebootloadermustfillthisandtherestoftheheaders(onlymarkedaswriteintheLinuxbootprotocol,forexamplethis)withvalueswhichiteithergotfromcommandlineorcalculated.Wewillnotseeadescriptionandexplanationofallfieldsofthekernelsetupheader,wewillgetbacktothatwhenthekernelusesthem.Youcanfindadescriptionofallfieldsinthebootprotocol.
Aswecanseeinthekernelbootprotocol,thememorymapwillbethefollowingafterloadingthekernel:
Bootloader
LinuxInside
11Frombootloadertokernel
|Protected-modekernel|
100000+------------------------+
|I/Omemoryhole|
0A0000+------------------------+
|ReservedforBIOS|Leaveasmuchaspossibleunused
~~
|Commandline|(CanalsobebelowtheX+10000mark)
X+10000+------------------------+
|Stack/heap|Forusebythekernelreal-modecode.
X+08000+------------------------+
|Kernelsetup|Thekernelreal-modecode.
|Kernelbootsector|Thekernellegacybootsector.
X+------------------------+
|Bootloader|
Sowhenthebootloadertransferscontroltothekernel,itstartsat:
0x1000+X+sizeof(KernelBootSector)+1
whereXistheaddressofthekernelbootsectorloaded.InmycaseXis0x10000,aswecanseeinamemorydump:
ThebootloaderhasnowloadedtheLinuxkernelintomemory,filledtheheaderfieldsandjumpedtoit.Nowwecanmovedirectlytothekernelsetupcode.
Finallyweareinthekernel.Technicallythekernelhasn'trunyet,weneedtosetupthekernel,memorymanager,processmanageretcfirst.Kernelsetupexecutionstartsfromarch/x86/boot/header.Sat_start.Itisalittlestrangeatfirstsight,asthereareseveralinstructionsbeforeit.
ALongtimeagotheLinuxkernelhaditsownbootloader,butnowifyourunforexample:
qemu-system-x86_64vmlinuz-3.18-generic
Youwillsee:
StartofKernelSetup
LinuxInside
12Frombootloadertokernel
Actuallyheader.SstartsfromMZ(seeimageabove),errormessageprintingandfollowingPEheader:
#ifdefCONFIG_EFI_STUB
#"MZ",MS-DOSheader
.byte0x4d
.byte0x5a
#endif
...
...
...
pe_header:
.ascii"PE"
.word0
ItneedsthistoloadanoperatingsystemwithUEFI.Wewon'tseehowthisworksrightnow,we'llseethisinoneofthenextchapters.
Sotheactualkernelsetupentrypointis:
//header.Sline292
.globl_start
_start:
Thebootloader(grub2andothers)knowsaboutthispoint(0x200offsetfromMZ)andmakesajumpdirectlytothispoint,despitethefactthatheader.Sstartsfrom.bstextsectionwhichprintsanerrormessage:
//
//arch/x86/boot/setup.ld
//
.=0;//currentposition
.bstext:{*(.bstext)}//put.bstextsectiontoposition0
.bsdata:{*(.bsdata)}
Sothekernelsetupentrypointis:
LinuxInside
13Frombootloadertokernel
.globl_start
_start:
.byte0xeb
.bytestart_of_setup-1f
1:
//
//restoftheheader
//
Herewecanseeajmpinstructionopcode-0xebtothestart_of_setup-1fpoint.Nfnotationmeans2freferstothenextlocal2:label.Inourcaseitislabel1whichgoesrightafterjump.Itcontainstherestofthesetupheader.Rightafterthesetupheaderweseethe.entrytextsectionwhichstartsatthestart_of_setuplabel.
Actuallythisisthefirstcodethatruns(asidefromthepreviousjumpinstructionofcourse).Afterthekernelsetupgotthecontrolfromthebootloader,thefirstjmpinstructionislocatedat0x200(first512bytes)offsetfromthestartofthekernelrealmode.ThiswecanreadintheLinuxkernelbootprotocolandalsoseeinthegrub2sourcecode:
state.gs=state.fs=state.es=state.ds=state.ss=segment;
state.cs=segment+0x20;
Itmeansthatsegmentregisterswillhavefollowingvaluesafterkernelsetupstarts:
gs=fs=es=ds=ss=0x1000
cs=0x1020
inmycasewhenthekernelisloadedat0x10000.
Afterthejumptostart_of_setup,itneedstodothefollowing:
BesurethatallvaluesofallsegmentregistersareequalSetupcorrectstackifneededSetupbssJumptoCcodeatmain.c
Let'slookattheimplementation.
Firstofallitensuresthatdsandessegmentregisterspointtothesameaddressanddisablesinterruptswithcliinstruction:
movw%ds,%ax
movw%ax,%es
cli
AsIwroteearlier,grub2loadskernelsetupcodeataddress0x10000andcsat0x1020becauseexecutiondoesn'tstartfromthestartoffile,butfrom:
_start:
.byte0xeb
.bytestart_of_setup-1f
Segmentregistersalign
LinuxInside
14Frombootloadertokernel
jump,whichisat512bytesoffsetfromthe4d5a.Italsoneedstoaligncsfrom0x10200to0x10000asallothersegmentregisters.Afterthatwesetupthestack:
pushw%ds
pushw$6f
lretw
pushdsvaluetostack,andaddressof6labelandexecutelretwinstruction.Whenwecalllretw,itloadsaddressoflabel6intotheinstructionpointerregisterandcswithvalueofds.Afterthiswewillhavedsandcswiththesamevalues.
Actually,almostallofthesetupcodeispreparationfortheClanguageenvironmentinrealmode.Thenextstepischeckingofssregistervalueandmakeacorrectstackifssiswrong:
movw%ss,%dx
cmpw%ax,%dx
movw%sp,%dx
je2f
Thiscanleadto3differentscenarios:
sshasvalidvalue0x10000(asallothersegmentregistersbesidecs)ssisinvalidandCAN_USE_HEAPflagisset(seebelow)ssisinvalidandCAN_USE_HEAPflagisnotset(seebelow)
Let'slookatallthreeofthesescenarios:
1. sshasacorrectaddress(0x10000).Inthiscasewegotolabel2:
2:andw$~3,%dx
jnz3f
movw$0xfffc,%dx
3:movw%ax,%ss
movzwl%dx,%esp
sti
Herewecanseealigningofdx(containsspgivenbybootloader)to4bytesandcheckingwhetheritiszero.Ifitiszero,weput0xfffc(4bytealignedaddressbeforemaximumsegmentsize-64KB)indx.Ifitisnotzerowecontinuetousespgivenbythebootloader(0xf7f4inmycase).Afterthisweputtheaxvaluetosswhichstoresthecorrectsegmentaddressof0x10000andsetsupacorrectsp.Wenowhaveacorrectstack:
StackSetup
LinuxInside
15Frombootloadertokernel
1. Inthesecondscenario,(ss!=ds).Firstofallputthe_end(addressofendofsetupcode)valueindxandchecktheloadflagsheaderfieldwiththetestbinstructiontooseewhetherwecanuseheapornot.loadflagsisabitmaskheaderwhichisdefinedas:
#defineLOADED_HIGH(1<<0)
#defineQUIET_FLAG(1<<5)
#defineKEEP_SEGMENTS(1<<6)
#defineCAN_USE_HEAP(1<<7)
Andaswecanreadinthebootprotocol:
Fieldname:loadflags
Thisfieldisabitmask.
Bit7(write):CAN_USE_HEAP
Setthisbitto1toindicatethatthevalueenteredinthe
heap_end_ptrisvalid.Ifthisfieldisclear,somesetupcode
functionalitywillbedisabled.
IftheCAN_USE_HEAPbitisset,putheap_end_ptrindxwhichpointsto_endandaddSTACK_SIZE(minimalstacksize-512bytes)toit.Afterthisifdxisnotcarry(itwillnotbecarry,dx=_end+512),jumptolabel2asinthepreviouscaseandmakeacorrectstack.
1. WhenCAN_USE_HEAPisnotset,wejustuseaminimalstackfrom_endto_end+STACK_SIZE:
LinuxInside
16Frombootloadertokernel
ThelasttwostepsthatneedtohappenbeforewecanjumptothemainCcode,aresettinguptheBSSareaandcheckingthe"magic"signature.First,signaturechecking:
cmpl$0x5a5aaa55,setup_sig
jnesetup_bad
Thissimplycomparesthesetup_sigwiththemagicnumber0x5a5aaa55.Iftheyarenotequal,afatalerrorisreported.
Ifthemagicnumbermatches,knowingwehaveasetofcorrectsegmentregistersandastack,weonlyneedtosetuptheBSSsectionbeforejumpingintotheCcode.
TheBSSsectionisusedtostorestaticallyallocated,uninitializeddata.Linuxcarefullyensuresthisareaofmemoryisfirstblanked,usingthefollowingcode:
movw$__bss_start,%di
movw$_end+3,%cx
xorl%eax,%eax
subw%di,%cx
shrw$2,%cx
rep;stosl
Firstofallthe__bss_startaddressismovedintodiandthe_end+3address(+3-alignsto4bytes)ismovedintocx.Theeaxregisteriscleared(usinganxorinstruction),andthebsssectionsize(cx-di)iscalculatedandputintocx.Then,cxisdividedbyfour(thesizeofa'word'),andthestoslinstructionisrepeatedlyused,storingthevalueofeax(zero)intotheaddresspointedtobydi,automaticallyincreasingdibyfour(thisoccursuntilcxreacheszero).Theneteffectofthiscodeisthatzerosarewrittenthroughallwordsinmemoryfrom__bss_startto_end:
BSSSetup
LinuxInside
17Frombootloadertokernel
That'sall,wehavethestack,BSSsowecanjumptothemain()Cfunction:
calllmain
Themain()functionislocatedinarch/x86/boot/main.c.Whatthisdoes,youcanreadinthenextpart.
ThisistheendofthefirstpartaboutLinuxkernelinternals.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateissue.InthenextpartwewillseefirstCcodewhichexecutesinLinuxkernelsetup,implementationofmemoryroutinesasmemset,memcpy,earlyprintkimplementationandearlyconsoleinitializationandmanymore.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.
Intel80386programmer'sreferencemanual1986MinimalBootLoaderforIntel®Architecture808680386ResetvectorRealmodeLinuxkernelbootprotocolCoreBootdevelopermanualRalfBrown'sInterruptListPowersupplyPowergoodsignal
Jumptomain
Conclusion
Links
LinuxInside
18Frombootloadertokernel
Westartedtodiveintolinuxkernelinternalsinthepreviouspartandsawtheinitialpartofthekernelsetupcode.Westoppedatthefirstcalltothemainfunction(whichisthefirstfunctionwritteninC)fromarch/x86/boot/main.c.
Inthispartwewillcontinuetoresearchthekernelsetupcodeand
seewhatprotectedmodeis,somepreparationforthetransitionintoit,theheapandconsoleinitialization,memorydetection,cpuvalidation,keyboardinitializationandmuchmuchmore.
So,Let'sgoahead.
BeforewecanmovetothenativeIntel64LongMode,thekernelmustswitchtheCPUintoprotectedmode.
Whatisprotectedmode?Protectedmodewasfirstaddedtothex86architecturein1982andwasthemainmodeofIntelprocessorsfromthe80286processoruntilIntel64andlongmodecame.
ThemainreasontomoveawayfromRealmodeisthatthereisverylimitedaccesstotheRAM.Asyoumayremember
fromthepreviouspart,thereisonly220bytesor1Megabyte,sometimesevenonly640KilobytesofRAMavailableintheRealmode.
Protectedmodebroughtmanychanges,butthemainoneisthedifferenceinmemorymanagement.The20-bitaddressbuswasreplacedwitha32-bitaddressbus.Itallowedaccessto4Gigabytesofmemoryvs1Megabyteofrealmode.Alsopagingsupportwasadded,whichyoucanreadaboutinthenextsections.
MemorymanagementinProtectedmodeisdividedintotwo,almostindependentparts:
SegmentationPaging
Herewewillonlyseesegmentation.Pagingwillbediscussedinthenextsections.
Asyoucanreadinthepreviouspart,addressesconsistoftwopartsinrealmode:
BaseaddressofthesegmentOffsetfromthesegmentbase
Andwecangetthephysicaladdressifweknowthesetwopartsby:
PhysicalAddress=Segment*16+Offset
Memorysegmentationwascompletelyredoneinprotectedmode.Thereareno64Kilobytefixed-sizesegments.Instead,thesizeandlocationofeachsegmentisdescribedbyanassociateddatastructurecalledSegmentDescriptor.The
Kernelbootingprocess.Part2.
Firststepsinthekernelsetup
Protectedmode
LinuxInside
19Firststepsinthekernelsetupcode
segmentdescriptorsarestoredinadatastructurecalledGlobalDescriptorTable(GDT).
TheGDTisastructurewhichresidesinmemory.Ithasnofixedplaceinthememoryso,itsaddressisstoredinthespecialGDTRregister.LaterwewillseetheGDTloadingintheLinuxkernelcode.Therewillbeanoperationforloadingitintomemory,somethinglike:
lgdtgdt
wherethelgdtinstructionloadsthebaseaddressandlimit(size)ofglobaldescriptortabletotheGDTRregister.GDTRisa48-bitregisterandconsistsoftwoparts:
size(16-bit)ofglobaldescriptortable;address(32-bit)oftheglobaldescriptortable.
AsmentionedabovetheGDTcontainssegmentdescriptorswhichdescribememorysegments.Eachdescriptoris64-bitsinsize.Thegeneralschemeofadescriptoris:
3124191670
------------------------------------------------------------
|||B||A|||||0|E|W|A||
|BASE31:24|G|/|L|V|LIMIT|P|DPL|S|TYPE|BASE23:16|4
|||D||L|19:16||||1|C|R|A||
------------------------------------------------------------
|||
|BASE15:0|LIMIT15:0|0
|||
------------------------------------------------------------
Don'tworry,Iknowitlooksalittlescaryafterrealmode,butit'seasy.ForexampleLIMIT15:0meansthatbit0-15oftheDescriptorcontainthevalueforthelimit.TherestofitisinLIMIT19:16.So,thesizeofLimitis0-19i.e20-bits.Let'stakeacloserlookatit:
1. Limit[20-bits]isat0-15,16-19bits.Itdefineslength_of_segment-1.ItdependsonG(Granularity)bit.
ifG(bit55)is0andsegmentlimitis0,thesizeofthesegmentis1ByteifGis1andsegmentlimitis0,thesizeofthesegmentis4096BytesifGis0andsegmentlimitis0xfffff,thesizeofthesegmentis1MegabyteifGis1andsegmentlimitis0xfffff,thesizeofthesegmentis4Gigabytes
So,itmeansthatif
ifGis0,Limitisinterpretedintermsof1Byteandthemaximumsizeofthesegmentcanbe1Megabyte.ifGis1,Limitisinterpretedintermsof4096Bytes=4KBytes=1Pageandthemaximumsizeofthesegmentcanbe4Gigabytes.ActuallywhenGis1,thevalueofLimitisshiftedtotheleftby12bits.So,20bits+12bits=
32bitsand232=4Gigabytes.2. Base[32-bits]isat(0-15,32-39and56-63bits).Itdefinesthephysicaladdressofthesegment'sstartinglocation.
3. Type/Attribute(40-47bits)definesthetypeofsegmentandkindsofaccesstoit.
Sflagatbit44specifiesdescriptortype.IfSis0thenthissegmentisasystemsegment,whereasifSis1thenthisisacodeordatasegment(Stacksegmentsaredatasegmentswhichmustberead/writesegments).
TodetermineifthesegmentisacodeordatasegmentwecancheckitsEx(bit43)Attributemarkedas0intheabovediagram.Ifitis0,thenthesegmentisaDatasegmentotherwiseitisacodesegment.
Asegmentcanbeofoneofthefollowingtypes:
LinuxInside
20Firststepsinthekernelsetupcode
|TypeField|DescriptorType|Description
|-----------------------------|-----------------|------------------
|Decimal||
|0EWA||
|00000|Data|Read-Only
|10001|Data|Read-Only,accessed
|20010|Data|Read/Write
|30011|Data|Read/Write,accessed
|40100|Data|Read-Only,expand-down
|50101|Data|Read-Only,expand-down,accessed
|60110|Data|Read/Write,expand-down
|70111|Data|Read/Write,expand-down,accessed
|CRA||
|81000|Code|Execute-Only
|91001|Code|Execute-Only,accessed
|101010|Code|Execute/Read
|111011|Code|Execute/Read,accessed
|121100|Code|Execute-Only,conforming
|141101|Code|Execute-Only,conforming,accessed
|131110|Code|Execute/Read,conforming
|151111|Code|Execute/Read,conforming,accessed
Aswecanseethefirstbit(bit43)is0foradatasegmentand1foracodesegment.Thenextthreebits(40,41,42,43)areeitherEWA(ExpansionWritableAccessible)orCRA(ConformingReadableAccessible).
ifE(bit42)is0,expandupotherwiseexpanddown.Readmorehere.ifW(bit41)(forDataSegments)is1,writeaccessisallowedotherwisenot.Notethatreadaccessisalwaysallowedondatasegments.A(bit40)-Whetherthesegmentisaccessedbyprocessorornot.C(bit43)isconformingbit(forcodeselectors).IfCis1,thesegmentcodecanbeexecutedfromalowerlevelprivilegefore.guserlevel.IfCis0,itcanonlybeexecutedfromthesameprivilegelevel.R(bit41)(forcodesegments).If1readaccesstosegmentisallowedotherwisenot.Writeaccessisneverallowedtocodesegments.
1. DPL[2-bits](DescriptorPrivilegeLevel)isatbits45-46.Itdefinestheprivilegelevelofthesegment.Itcanbe0-3where0isthemostprivileged.
2. Pflag(bit47)-indicatesifthesegmentispresentinmemoryornot.IfPis0,thesegmentwillbepresentedasinvalidandtheprocessorwillrefusetoreadthissegment.
3. AVLflag(bit52)-Availableandreservedbits.ItisignoredinLinux.
4. Lflag(bit53)-indicateswhetheracodesegmentcontainsnative64-bitcode.If1thenthecodesegmentexecutesin64bitmode.
5. D/Bflag(bit54)-Default/Bigflagrepresentstheoperandsizei.e16/32bits.Ifitissetthen32bitotherwise16.
Segmentregistersdon'tcontainthebaseaddressofthesegmentasinrealmode.Insteadtheycontainaspecialstructure-SegmentSelector.EachSegmentDescriptorhasanassociatedSegmentSelector.SegmentSelectorisa16-bitstructure:
-----------------------------
|Index|TI|RPL|
-----------------------------
Where,
IndexshowstheindexnumberofthedescriptorintheGDT.TI(TableIndicator)showswheretosearchforthedescriptor.Ifitis0thensearchintheGlobalDescriptorTable(GDT)otherwiseitwilllookinLocalDescriptorTable(LDT).
LinuxInside
21Firststepsinthekernelsetupcode
AndRPLisRequester'sPrivilegeLevel.
Everysegmentregisterhasavisibleandhiddenpart.
Visible-SegmentSelectorisstoredhereHidden-SegmentDescriptor(base,limit,attributes,flags)
Thefollowingstepsareneededtogetthephysicaladdressintheprotectedmode:
ThesegmentselectormustbeloadedinoneofthesegmentregistersTheCPUtriestofindasegmentdescriptorbyGDTaddress+IndexfromselectorandloadthedescriptorintothehiddenpartofthesegmentregisterBaseaddress(fromsegmentdescriptor)+offsetwillbethelinearaddressofthesegmentwhichisthephysicaladdress(ifpagingisdisabled).
Schematicallyitwilllooklikethis:
Thealgorithmforthetransitionfromrealmodeintoprotectedmodeis:
DisableinterruptsDescribeandloadGDTwithlgdtinstructionSetPE(ProtectionEnable)bitinCR0(ControlRegister0)Jumptoprotectedmodecode
Wewillseethecompletetransitiontoprotectedmodeinthelinuxkernelinthenextpart,butbeforewecanmoveto
LinuxInside
22Firststepsinthekernelsetupcode
protectedmode,weneedtodosomemorepreparations.
Let'slookatarch/x86/boot/main.c.Wecanseesomeroutinestherewhichperformkeyboardinitialization,heapinitialization,etc...Let'stakealook.
Wewillstartfromthemainroutinein"main.c".Firstfunctionwhichiscalledinmainiscopy_boot_params(void).Itcopiesthekernelsetupheaderintothefieldoftheboot_paramsstructurewhichisdefinedinthearch/x86/include/uapi/asm/bootparam.h.
Theboot_paramsstructurecontainsthestructsetup_headerhdrfield.Thisstructurecontainsthesamefieldsasdefinedinlinuxbootprotocolandisfilledbythebootloaderandalsoatkernelcompile/buildtime.copy_boot_paramsdoestwothings:
1. Copieshdrfromheader.Stotheboot_paramsstructureinsetup_headerfield
2. Updatespointertothekernelcommandlineifthekernelwasloadedwiththeoldcommandlineprotocol.
Notethatitcopieshdrwithmemcpyfunctionwhichisdefinedinthecopy.Ssourcefile.Let'shavealookinside:
GLOBAL(memcpy)
pushw%si
pushw%di
movw%ax,%di
movw%dx,%si
pushw%cx
shrw$2,%cx
rep;movsl
popw%cx
andw$3,%cx
rep;movsb
popw%di
popw%si
retl
ENDPROC(memcpy)
Yeah,wejustmovedtoCcodeandnowassemblyagain:)Firstofallwecanseethatmemcpyandotherroutineswhicharedefinedhere,startandendwiththetwomacros:GLOBALandENDPROC.GLOBALisdescribedinarch/x86/include/asm/linkage.hwhichdefinesglobldirectiveandthelabelforit.ENDPROCisdescribedininclude/linux/linkage.hwhichmarksnamesymbolasfunctionnameandendswiththesizeofthenamesymbol.
Implementationofmemcpyiseasy.Atfirst,itpushesvaluesfromsianddiregisterstothestackbecausetheirvalueswillchangeduringthememcpy,soitpushesthemonthestacktopreservetheirvalues.memcpy(andotherfunctionsincopy.S)usefastcallcallingconventions.Soitgetsitsincomingparametersfromtheax,dxandcxregisters.Callingmemcpylookslikethis:
memcpy(&boot_params.hdr,&hdr,sizeofhdr);
So,
axwillcontaintheaddressoftheboot_params.hdrinbytesdxwillcontaintheaddressofhdrinbytescxwillcontainthesizeofhdrinbytes.
memcpyputstheaddressofboot_params.hdrintosiandsavesthesizeonthestack.Afterthisitshiftstotherighton2size(ordivideon4)andcopiesfromsitodiby4bytes.Afterthiswerestorethesizeofhdragain,alignitby4bytes
Copyingbootparametersintothe"zeropage"
LinuxInside
23Firststepsinthekernelsetupcode
andcopytherestofthebytesfromsitodibytebybyte(ifthereismore).Restoresianddivaluesfromthestackintheendandafterthiscopyingisfinished.
Afterthehdriscopiedintoboot_params.hdr,thenextstepisconsoleinitializationbycallingtheconsole_initfunctionwhichisdefinedinarch/x86/boot/early_serial_console.c.
Ittriestofindtheearlyprintkoptioninthecommandlineandifthesearchwassuccessful,itparsestheportaddressandbaudrateoftheserialportandinitializestheserialport.Valueofearlyprintkcommandlineoptioncanbeoneofthese:
*serial,0x3f8,115200
*serial,ttyS0,115200
*ttyS0,115200
Afterserialportinitializationwecanseethefirstoutput:
if(cmdline_find_option_bool("debug"))
puts("earlyconsoleinsetupcode\n");
Thedefinitionofputsisintty.c.Aswecanseeitprintscharacterbycharacterinaloopbycallingtheputcharfunction.Let'slookintotheputcharimplementation:
void__attribute__((section(".inittext")))putchar(intch)
{
if(ch=='\n')
putchar('\r');
bios_putchar(ch);
if(early_serial_base!=0)
serial_putchar(ch);
}
__attribute__((section(".inittext")))meansthatthiscodewillbeinthe.inittextsection.Wecanfinditinthelinkerfilesetup.ld.
Firstofall,putcharchecksforthe\nsymbolandifitisfound,prints\rbefore.AfterthatitoutputsthecharacterontheVGAscreenbycallingtheBIOSwiththe0x10interruptcall:
staticvoid__attribute__((section(".inittext")))bios_putchar(intch)
{
structbiosregsireg;
initregs(&ireg);
ireg.bx=0x0007;
ireg.cx=0x0001;
ireg.ah=0x0e;
ireg.al=ch;
intcall(0x10,&ireg,NULL);
}
Hereinitregstakesthebiosregsstructureandfirstfillsbiosregswithzerosusingthememsetfunctionandthenfillsitwithregistervalues.
Consoleinitialization
LinuxInside
24Firststepsinthekernelsetupcode
memset(reg,0,sizeof*reg);
reg->eflags|=X86_EFLAGS_CF;
reg->ds=ds();
reg->es=ds();
reg->fs=fs();
reg->gs=gs();
Let'slookatthememsetimplementation:
GLOBAL(memset)
pushw%di
movw%ax,%di
movzbl%dl,%eax
imull$0x01010101,%eax
pushw%cx
shrw$2,%cx
rep;stosl
popw%cx
andw$3,%cx
rep;stosb
popw%di
retl
ENDPROC(memset)
Asyoucanreadabove,itusesthefastcallcallingconventionslikethememcpyfunction,whichmeansthatthefunctiongetsparametersfromax,dxandcxregisters.
Generallymemsetislikeamemcpyimplementation.Itsavesthevalueofthediregisteronthestackandputstheaxvalueintodiwhichistheaddressofthebiosregsstructure.Nextisthemovzblinstruction,whichcopiesthedlvaluetothelow2bytesoftheeaxregister.Theremaining2highbytesofeaxwillbefilledwithzeros.
Thenextinstructionmultiplieseaxwith0x01010101.Itneedstobecausememsetwillcopy4bytesatthesametime.Forexample,weneedtofillastructurewith0x7withmemset.eaxwillcontain0x00000007valueinthiscase.Soifwemultiplyeaxwith0x01010101,wewillget0x07070707andnowwecancopythese4bytesintothestructure.memsetusesrep;stoslinstructionsforcopyingeaxintoes:di.
Therestofthememsetfunctiondoesalmostthesameasmemcpy.
Afterthatbiosregsstructureisfilledwithmemset,bios_putcharcallsthe0x10interruptwhichprintsacharacter.Afterwardsitchecksiftheserialportwasinitializedornotandwritesacharactertherewithserial_putcharandinb/outbinstructionsifitwasset.
Afterthestackandbsssectionwerepreparedinheader.S(seepreviouspart),thekernelneedstoinitializetheheapwiththeinit_heapfunction.
Firstofallinit_heapcheckstheCAN_USE_HEAPflagfromtheloadflagsinthekernelsetupheaderandcalculatestheendofthestackifthisflagwasset:
char*stack_end;
if(boot_params.hdr.loadflags&CAN_USE_HEAP){
asm("leal%P1(%%esp),%0"
:"=r"(stack_end):"i"(-STACK_SIZE));
orinotherwordsstack_end=esp-STACK_SIZE.
Heapinitialization
LinuxInside
25Firststepsinthekernelsetupcode
Thenthereistheheap_endcalculation:
heap_end=(char*)((size_t)boot_params.hdr.heap_end_ptr+0x200);
whichmeansheap_end_ptror_end+512(0x200h).Andatthelastischeckedthatwhetherheap_endisgreaterthanstack_end.Ifitisthenstack_endisassignedtoheap_endtomakethemequal.
NowtheheapisinitializedandwecanuseitusingtheGET_HEAPmethod.Wewillseehowitisused,howtouseitandhowtheitisimplementedinthenextposts.
Thenextstepaswecanseeiscpuvalidationbyvalidate_cpufromarch/x86/boot/cpu.c.
Itcallsthecheck_cpufunctionandpassescpulevelandrequiredcpuleveltoitandchecksthatthekernellaunchesontherightcpulevel.
check_cpu(&cpu_level,&req_level,&err_flags);
if(cpu_level<req_level){
...
return-1;
}
check_cpuchecksthecpu'sflags,presenceoflongmodeincaseofx86_64(64-bit)CPU,checkstheprocessor'svendorandmakespreparationforcertainvendorsliketurningoffSSE+SSE2forAMDiftheyaremissing,etc.
Thenextstepismemorydetectionbythedetect_memoryfunction.detect_memorybasicallyprovidesamapofavailableRAMtothecpu.Itusesdifferentprogramminginterfacesformemorydetectionlike0xe820,0xe801and0x88.Wewillseeonlytheimplementationof0xE820here.
Let'slookintothedetect_memory_e820implementationfromthearch/x86/boot/memory.csourcefile.Firstofall,thedetect_memory_e820functioninitializesthebiosregsstructureaswesawaboveandfillsregisterswithspecialvaluesforthe0xe820call:
initregs(&ireg);
ireg.ax=0xe820;
ireg.cx=sizeofbuf;
ireg.edx=SMAP;
ireg.di=(size_t)&buf;
axcontainsthenumberofthefunction(0xe820inourcase)cxregistercontainssizeofthebufferwhichwillcontaindataaboutmemoryedxmustcontaintheSMAPmagicnumberes:dimustcontaintheaddressofthebufferwhichwillcontainmemorydataebxhastobezero.
Nextisaloopwheredataaboutthememorywillbecollected.Itstartsfromthecallofthe0x15BIOSinterrupt,whichwritesonelinefromtheaddressallocationtable.Forgettingthenextlineweneedtocallthisinterruptagain(whichwedointheloop).Beforethenextcallebxmustcontainthevaluereturnedpreviously:
CPUvalidation
Memorydetection
LinuxInside
26Firststepsinthekernelsetupcode
intcall(0x15,&ireg,&oreg);
ireg.ebx=oreg.ebx;
Ultimately,itdoesiterationsinthelooptocollectdatafromtheaddressallocationtableandwritesthisdataintothee820entryarray:
startofmemorysegmentsizeofmemorysegmenttypeofmemorysegment(whichcanbereserved,usableandetc...).
Youcanseetheresultofthisinthedmesgoutput,somethinglike:
[0.000000]e820:BIOS-providedphysicalRAMmap:
[0.000000]BIOS-e820:[mem0x0000000000000000-0x000000000009fbff]usable
[0.000000]BIOS-e820:[mem0x000000000009fc00-0x000000000009ffff]reserved
[0.000000]BIOS-e820:[mem0x00000000000f0000-0x00000000000fffff]reserved
[0.000000]BIOS-e820:[mem0x0000000000100000-0x000000003ffdffff]usable
[0.000000]BIOS-e820:[mem0x000000003ffe0000-0x000000003fffffff]reserved
[0.000000]BIOS-e820:[mem0x00000000fffc0000-0x00000000ffffffff]reserved
Thenextstepistheinitializationofthekeyboardwiththecallofthekeyboard_init()function.Atfirstkeyboard_initinitializesregistersusingtheinitregsfunctionandcallingthe0x16interruptforgettingthekeyboardstatus.
initregs(&ireg);
ireg.ah=0x02;/*Getkeyboardstatus*/
intcall(0x16,&ireg,&oreg);
boot_params.kbd_status=oreg.al;
Afterthisitcalls0x16againtosetrepeatrateanddelay.
ireg.ax=0x0305;/*Setkeyboardrepeatrate*/
intcall(0x16,&ireg,NULL);
Thenextcoupleofstepsarequeriesfordifferentparameters.Wewillnotdiveintodetailsaboutthesequeries,butwillgetbacktoitinlaterparts.Let'stakeashortlookatthesefunctions:
Thequery_mcaroutinecallsthe0x15BIOSinterrupttogetthemachinemodelnumber,sub-modelnumber,BIOSrevisionlevel,andotherhardware-specificattributes:
intquery_mca(void)
{
structbiosregsireg,oreg;
u16len;
initregs(&ireg);
ireg.ah=0xc0;
intcall(0x15,&ireg,&oreg);
if(oreg.eflags&X86_EFLAGS_CF)
Keyboardinitialization
Querying
LinuxInside
27Firststepsinthekernelsetupcode
return-1;/*NoMCApresent*/
set_fs(oreg.es);
len=rdfs16(oreg.bx);
if(len>sizeof(boot_params.sys_desc_table))
len=sizeof(boot_params.sys_desc_table);
copy_from_fs(&boot_params.sys_desc_table,oreg.bx,len);
return0;
}
Itfillstheahregisterwith0xc0andcallsthe0x15BIOSinterruption.Aftertheinterruptexecutionitchecksthecarryflagandifitissetto1,theBIOSdoesn'tsupport(MCA)[https://en.wikipedia.org/wiki/Micro_Channel_architecture].Ifcarryflagissetto0,ES:BXwillcontainapointertothesysteminformationtable,whichlookslikethis:
OffsetSizeDescription)
00hWORDnumberofbytesfollowing
02hBYTEmodel(see#00515)
03hBYTEsubmodel(see#00515)
04hBYTEBIOSrevision:0forfirstrelease,1for2nd,etc.
05hBYTEfeaturebyte1(see#00510)
06hBYTEfeaturebyte2(see#00511)
07hBYTEfeaturebyte3(see#00512)
08hBYTEfeaturebyte4(see#00513)
09hBYTEfeaturebyte5(see#00514)
---AWARDBIOS---
0AhNBYTEsAWARDcopyrightnotice
---PhoenixBIOS---
0AhBYTE???(00h)
0BhBYTEmajorversion
0ChBYTEminorversion(BCD)
0Dh4BYTEsASCIZstring"PTL"(PhoenixTechnologiesLtd)
---QuadramQuad386---
0Ah17BYTEsASCIIsignaturestring"QuadramQuad386XT"
---Toshiba(SatellitePro435CDSatleast)---
0Ah7BYTEssignature"TOSHIBA"
11hBYTE???(8h)
12hBYTE???(E7h)productID???(guess)
13h3BYTEs"JPN"
Nextwecalltheset_fsroutineandpassthevalueoftheesregistertoit.Implementationofset_fsisprettysimple:
staticinlinevoidset_fs(u16seg)
{
asmvolatile("movw%0,%%fs"::"rm"(seg));
}
Thisfunctioncontainsinlineassemblywhichgetsthevalueofthesegparameterandputsitintothefsregister.Therearemanyfunctionsinboot.hlikeset_fs,forexampleset_gs,fs,gsforreadingavalueinitetc...
Attheendofquery_mcaitjustcopiesthetablewhichpointedtobyes:bxtotheboot_params.sys_desc_table.
ThenextstepisgettingIntelSpeedStepinformationbycallingthequery_istfunction.FirstofallitcheckstheCPUlevelandifitiscorrect,calls0x15forgettinginfoandsavestheresulttoboot_params.
Thefollowingquery_apm_biosfunctiongetsAdvancedPowerManagementinformationfromtheBIOS.query_apm_bioscallsthe0x15BIOSinterruptiontoo,butwithah=0x53tocheckAPMinstallation.Afterthe0x15execution,query_apm_biosfunctionschecksPMsignature(itmustbe0x504d),carryflag(itmustbe0ifAPMsupported)andvalueofthecxregister(ifit's0x02,protectedmodeinterfaceissupported).
Nextitcallsthe0x15again,butwithax=0x5304fordisconnectingtheAPMinterfaceandconnectingthe32-bitprotected
LinuxInside
28Firststepsinthekernelsetupcode
modeinterface.Intheenditfillsboot_params.apm_bios_infowithvaluesobtainedfromtheBIOS.
Notethatquery_apm_bioswillbeexecutedonlyifCONFIG_APMorCONFIG_APM_MODULEwassetinconfigurationfile:
#ifdefined(CONFIG_APM)||defined(CONFIG_APM_MODULE)
query_apm_bios();
#endif
Thelastisthequery_eddfunction,whichqueriesEnhancedDiskDriveinformationfromtheBIOS.Let'slookintothequery_eddimplementation.
Firstofallitreadstheeddoptionfromkernel'scommandlineandifitwassettooffthenquery_eddjustreturns.
IfEDDisenabled,query_eddgoesoverBIOS-supportedharddisksandqueriesEDDinformationinthefollowingloop:
for(devno=0x80;devno<0x80+EDD_MBR_SIG_MAX;devno++){
if(!get_edd_info(devno,&ei)&&boot_params.eddbuf_entries<EDDMAXNR){
memcpy(edp,&ei,sizeofei);
edp++;
boot_params.eddbuf_entries++;
}
...
...
...
where0x80isthefirstharddriveandthevalueofEDD_MBR_SIG_MAXmacrois16.Itcollectsdataintothearrayofedd_infostructures.get_edd_infochecksthatEDDispresentbyinvokingthe0x13interruptwithahas0x41andifEDDispresent,get_edd_infoagaincallsthe0x13interrupt,butwithahas0x48andsicontainingtheaddressofthebufferwhereEDDinformationwillbestored.
ThisistheendofthesecondpartaboutLinuxkernelinternals.Inthenextpartwewillseevideomodesettingandtherestofpreparationsbeforetransitiontoprotectedmodeanddirectlytransitioningintoit.
Ifyouhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmeaPRtolinux-internals.
ProtectedmodeProtectedmodeLongmodeNiceexplanationofCPUModeswithcodeHowtoUseExpandDownSegmentsonIntel386andLaterCPUsearlyprintkdocumentationKernelParametersSerialconsoleIntelSpeedStepAPMEDDspecification
Conclusion
Links
LinuxInside
29Firststepsinthekernelsetupcode
TLDPdocumentationforLinuxBootProcess(old)PreviousPart
LinuxInside
30Firststepsinthekernelsetupcode
ThisisthethirdpartoftheKernelbootingprocessseries.Inthepreviouspart,westoppedrightbeforethecalloftheset_videoroutinefromthemain.c.Inthispart,wewillsee:
videomodeinitializationinthekernelsetupcode,preparationbeforeswitchingintotheprotectedmode,transitiontoprotectedmode
NOTEIfyoudon'tknowanythingaboutprotectedmode,youcanfindsomeinformationaboutitinthepreviouspart.Alsothereareacoupleoflinkswhichcanhelpyou.
AsIwroteabove,wewillstartfromtheset_videofunctionwhichdefinedinthearch/x86/boot/video.csourcecodefile.Wecanseethatitstartsbyfirstgettingthevideomodefromtheboot_params.hdrstructure:
u16mode=boot_params.hdr.vid_mode;
whichwefilledinthecopy_boot_paramsfunction(youcanreadaboutitinthepreviouspost).vid_modeisanobligatoryfieldwhichisfilledbythebootloader.Youcanfindinformationaboutitinthekernelbootprotocol:
OffsetProtoNameMeaning
/Size
01FA/2ALLvid_modeVideomodecontrol
Aswecanreadfromthelinuxkernelbootprotocol:
vga=<mode>
<mode>hereiseitheraninteger(inCnotation,either
decimal,octal,orhexadecimal)oroneofthestrings
"normal"(meaning0xFFFF),"ext"(meaning0xFFFE)or"ask"
(meaning0xFFFD).Thisvalueshouldbeenteredintothe
vid_modefield,asitisusedbythekernelbeforethecommand
lineisparsed.
Sowecanaddvgaoptiontothegruboranotherbootloaderconfigurationfileanditwillpassthisoptiontothekernelcommandline.Thisoptioncanhavedifferentvaluesaswecanmentionedinthedescription,forexampleitcanbeanintegernumber0xFFFDorask.Ifyoupassasktovga,youwillseeamenulikethis:
Kernelbootingprocess.Part3.
Videomodeinitializationandtransitiontoprotectedmode
LinuxInside
31Videomodeinitializationandtransitiontoprotectedmode
whichwillasktoselectavideomode.Wewilllookatitsimplementation,butbeforedivingintotheimplementationwehavetolookatsomeotherthings.
Earlierwesawdefinitionsofdifferentdatatypeslikeu16etc.inthekernelsetupcode.Let'slookonacoupleofdatatypesprovidedbythekernel:
Type char short int long u8 u16 u32 u64
Size 1 2 4 8 1 2 4 8
Ifyoureadsourcecodeofthekernel,you'llseetheseveryoftenandsoitwillbegoodtorememberthem.
Afterwehavevid_modefromtheboot_params.hdrintheset_videofunctionwecanseecalltoRESET_HEAPfunction.RESET_HEAPisamacrowhichdefinedintheboot.h.Itisdefinedas:
#defineRESET_HEAP()((void*)(HEAP=_end))
Ifyouhavereadthesecondpart,youwillrememberthatweinitializedtheheapwiththeinit_heapfunction.Wehaveacoupleofutilityfunctionsforheapwhicharedefinedinboot.h.Theyare:
#defineRESET_HEAP()
AswesawjustaboveitresetstheheapbysettingtheHEAPvariableequalto_end,where_endisjustexternchar_end[];
Kerneldatatypes
HeapAPI
LinuxInside
32Videomodeinitializationandtransitiontoprotectedmode
NextisGET_HEAPmacro:
#defineGET_HEAP(type,n)\
((type*)__get_heap(sizeof(type),__alignof__(type),(n)))
forheapallocation.Itcallstheinternalfunction__get_heapwith3parameters:
sizeofatypeinbytes,whichneedbeallocated__alignof__(type)showshowvariablesofthistypearealignedntellshowmanyitemstoallocate
Implementationof__get_heapis:
staticinlinechar*__get_heap(size_ts,size_ta,size_tn)
{
char*tmp;
HEAP=(char*)(((size_t)HEAP+(a-1))&~(a-1));
tmp=HEAP;
HEAP+=s*n;
returntmp;
}
andfurtherwewillseeitsusage,somethinglike:
saved.data=GET_HEAP(u16,saved.x*saved.y);
Let'strytounderstandhow__get_heapworks.WecanseeherethatHEAP(whichisequalto_endafterRESET_HEAP())istheaddressofalignedmemoryaccordingtoaparameter.AfteritwesavememoryaddressfromHEAPtothetmpvariable,moveHEAPtotheendofallocatedblockandreturntmpwhichisstartaddressofallocatedmemory.
Andthelastfunctionis:
staticinlineboolheap_free(size_tn)
{
return(int)(heap_end-HEAP)>=(int)n;
}
whichsubtractsvalueoftheHEAPfromtheheap_end(wecalculateditinthepreviouspart)andreturns1ifthereisenoughmemoryforn.
That'sall.NowwehavesimpleAPIforheapandcansetupvideomode.
Nowwecanmovedirectlytovideomodeinitialization.WestoppedattheRESET_HEAP()callintheset_videofunction.Nextisthecalltostore_mode_paramswhichstoresvideomodeparametersintheboot_params.screen_infostructurewhichisdefinedintheinclude/uapi/linux/screen_info.h.
Ifwewilllookatstore_mode_paramsfunction,wecanseethatitstartswiththecalltostore_cursor_positionfunction.Asyoucanunderstandfromthefunctionname,itgetsinformationaboutcursorandstoresit.
Firstofallstore_cursor_positioninitializestwovariableswhichhastype-biosregs,withAH=0x3andcalls0x10BIOS
Setupvideomode
LinuxInside
33Videomodeinitializationandtransitiontoprotectedmode
interruption.Afterinterruptionsuccessfullyexecuted,itreturnsrowandcolumnintheDLandDHregisters.Rowandcolumnwillbestoredintheorig_xandorig_yfieldsfromthetheboot_params.screen_infostructure.
Afterstore_cursor_positionexecuted,store_video_modefunctionwillbecalled.Itjustgetscurrentvideomodeandstoresitintheboot_params.screen_info.orig_video_mode.
Afterthis,itcheckscurrentvideomodeandsetsthevideo_segment.AftertheBIOStransferscontroltothebootsector,thefollowingaddressesareforvideomemory:
0xB000:0x000032KbMonochromeTextVideoMemory
0xB800:0x000032KbColorTextVideoMemory
Sowesetthevideo_segmentvariableto0xB000ifcurrentvideomodeisMDA,HGC,VGAinmonochromemodeor0xB800incolormode.Aftersetupoftheaddressofthevideosegmentfontsizeneedstobestoredintheboot_params.screen_info.orig_video_pointswith:
set_fs(0);
font_size=rdfs16(0x485);
boot_params.screen_info.orig_video_points=font_size;
Firstofallweput0totheFSregisterwithset_fsfunction.Wealreadysawfunctionslikeset_fsinthepreviouspart.Theyarealldefinedintheboot.h.Nextwereadvaluewhichislocatedataddress0x485(thismemorylocationisusedtogetthefontsize)andsavefontsizeintheboot_params.screen_info.orig_video_points.
x=rdfs16(0x44a);
y=(adapter==ADAPTER_CGA)?25:rdfs8(0x484)+1;
Nextwegetamountofcolumnsby0x44aandrowsbyaddress0x484andstorethemintheboot_params.screen_info.orig_video_colsandboot_params.screen_info.orig_video_lines.Afterthis,executionofthestore_mode_paramsisfinished.
Nextwecanseesave_screenfunctionwhichjustsavesscreencontenttotheheap.Thisfunctioncollectsalldatawhichwegotinthepreviousfunctionslikerowsandcolumnsamountetc.andstoresitinthesaved_screenstructure,whichisdefinedas:
staticstructsaved_screen{
intx,y;
intcurx,cury;
u16*data;
}saved;
Itthencheckswhethertheheaphasfreespaceforitwith:
if(!heap_free(saved.x*saved.y*sizeof(u16)+512))
return;
andallocatesspaceintheheapifitisenoughandstoressaved_screeninit.
Thenextcallisprobe_cards(0)fromthearch/x86/boot/video-mode.c.Itgoesoverallvideo_cardsandcollectsnumberofmodesprovidedbythecards.Hereistheinterestingmoment,wecanseetheloop:
LinuxInside
34Videomodeinitializationandtransitiontoprotectedmode
for(card=video_cards;card<video_cards_end;card++){
/*collectingnumberofmodeshere*/
}
butvideo_cardsnotdeclaredanywhere.Answerissimple:Everyvideomodepresentedinthex86kernelsetupcodehasdefinitionlikethis:
static__videocardvideo_vga={
.card_name="VGA",
.probe=vga_probe,
.set_mode=vga_set_mode,
};
where__videocardisamacro:
#define__videocardstructcard_info__attribute__((used,section(".videocards")))
whichmeansthatcard_infostructure:
structcard_info{
constchar*card_name;
int(*set_mode)(structmode_info*mode);
int(*probe)(void);
structmode_info*modes;
intnmodes;
intunsafe;
u16xmode_first;
u16xmode_n;
};
isinthe.videocardssegment.Let'slookinthearch/x86/boot/setup.ldlinkerfile,wecanseethere:
.videocards:{
video_cards=.;
*(.videocards)
video_cards_end=.;
}
Itmeansthatvideo_cardsisjustmemoryaddressandallcard_infostructuresareplacedinthissegment.Itmeansthatallcard_infostructuresareplacedbetweenvideo_cardsandvideo_cards_end,sowecanuseitinalooptogooverallofit.Afterprobe_cardsexecutedwehaveallstructureslikestatic__videocardvideo_vgawithfillednmodes(numberofvideomodes).
Afterprobe_cardsexecutionisfinished,wemovetothemainloopintheset_videofunction.Thereisinfiniteloopwhichtriestosetupvideomodewiththeset_modefunctionorprintsamenuifwepassedvid_mode=asktothekernelcommandlineorvideomodeisundefined.
Theset_modefunctionisdefinedinthevideo-mode.candgetsonlyoneparameter,modewhichisthenumberofvideomode(wegotitorfromthemenuorinthestartofthesetup_video,fromkernelsetupheader).
set_modefunctionchecksthemodeandcallsraw_set_modefunction.Theraw_set_modecallsset_modefunctionforselectedcardi.e.card->set_mode(structmode_info*).Wecangetaccesstothisfunctionfromthecard_infostructure,everyvideomodedefinesthisstructurewithvaluesfilleddependinguponthevideomode(forexampleforvgaitisvideo_vga.set_modefunction,seeaboveexampleofcard_infostructureforvga).video_vga.set_modeisvga_set_mode,whichchecksthevgamodeandcallstherespectivefunction:
LinuxInside
35Videomodeinitializationandtransitiontoprotectedmode
staticintvga_set_mode(structmode_info*mode)
{
vga_set_basic_mode();
force_x=mode->x;
force_y=mode->y;
switch(mode->mode){
caseVIDEO_80x25:
break;
caseVIDEO_8POINT:
vga_set_8font();
break;
caseVIDEO_80x43:
vga_set_80x43();
break;
caseVIDEO_80x28:
vga_set_14font();
break;
caseVIDEO_80x30:
vga_set_80x30();
break;
caseVIDEO_80x34:
vga_set_80x34();
break;
caseVIDEO_80x60:
vga_set_80x60();
break;
}
return0;
}
Everyfunctionwhichsetupsvideomode,justcalls0x10BIOSinterruptwithcertainvalueintheAHregister.
Afterwehavesetvideomode,wepassittotheboot_params.hdr.vid_mode.
Nextvesa_store_edidiscalled.ThisfunctionsimplystorestheEDID(ExtendedDisplayIdentificationData)informationforkerneluse.Afterthisstore_mode_paramsiscalledagain.Lastly,ifdo_restoreisset,screenisrestoredtoanearlierstate.
Afterthiswehavesetvideomodeandnowwecanswitchtotheprotectedmode.
Wecanseethelastfunctioncall-go_to_protected_modeinthemain.c.Asthecommentsays:Dothelastthingsandinvokeprotectedmode,solet'sseetheselastthingsandswitchintotheprotectedmode.
go_to_protected_modedefinedinthearch/x86/boot/pm.c.Itcontainssomefunctionswhichmakelastpreparationsbeforewecanjumpintoprotectedmode,solet'slookonitandtrytounderstandwhattheydoandhowitworks.
Firstisthecalltorealmode_switch_hookfunctioninthego_to_protected_mode.ThisfunctioninvokesrealmodeswitchhookifitispresentanddisablesNMI.Hooksareusedifbootloaderrunsinahostileenvironment.Youcanreadmoreabouthooksinthebootprotocol(seeADVANCEDBOOTLOADERHOOKS).
readlmode_swtichhookpresentspointertothe16-bitrealmodefarsubroutinewhichdisablesnon-maskableinterrupts.Afterrealmode_switchhook(itisn'tpresentforme)ischecked,disablingofNon-MaskableInterrupts(NMI)occurs:
asmvolatile("cli");
outb(0x80,0x70);/*DisableNMI*/
io_delay();
Atfirstthereisinlineassemblyinstructionwithcliinstructionwhichclearstheinterruptflag(IF).Afterthis,external
Lastpreparationbeforetransitionintoprotectedmode
LinuxInside
36Videomodeinitializationandtransitiontoprotectedmode
interruptsaredisabled.NextlinedisablesNMI(non-maskableinterrupt).
InterruptisasignaltotheCPUwhichisemittedbyhardwareorsoftware.Aftergettingsignal,CPUsuspendscurrentinstructionssequence,savesitsstateandtransferscontroltotheinterrupthandler.Afterinterrupthandlerhasfinishedit'swork,ittransferscontroltotheinterruptedinstruction.Non-maskableinterrupts(NMI)areinterruptswhicharealwaysprocessed,independentlyofpermission.Itcannotbeignoredandistypicallyusedtosignalfornon-recoverablehardwareerrors.Wewillnotdiveintodetailsofinterruptsnow,butwilldiscussitinthenextposts.
Let'sgetbacktothecode.Wecanseethatsecondlineiswriting0x80(disabledbit)bytetothe0x70(CMOSAddressregister).Afterthatcalltotheio_delayfunctionoccurs.io_delaycausesasmalldelayandlookslike:
staticinlinevoidio_delay(void)
{
constu16DELAY_PORT=0x80;
asmvolatile("outb%%al,%0"::"dN"(DELAY_PORT));
}
Outputtinganybytetotheport0x80shoulddelayexactly1microsecond.Sowecanwriteanyvalue(valuefromALregisterinourcase)tothe0x80port.Afterthisdelayrealmode_switch_hookfunctionhasfinishedexecutionandwecanmovetothenextfunction.
Thenextfunctionisenable_a20,whichenablesA20line.Thisfunctionisdefinedinthearch/x86/boot/a20.candittriestoenableA20gatewithdifferentmethods.Thefirstisa20_test_shortfunctionwhichchecksisA20alreadyenabledornotwitha20_testfunction:
staticinta20_test(intloops)
{
intok=0;
intsaved,ctr;
set_fs(0x0000);
set_gs(0xffff);
saved=ctr=rdfs32(A20_TEST_ADDR);
while(loops--){
wrfs32(++ctr,A20_TEST_ADDR);
io_delay();/*Serializeandmakedelayconstant*/
ok=rdgs32(A20_TEST_ADDR+0x10)^ctr;
if(ok)
break;
}
wrfs32(saved,A20_TEST_ADDR);
returnok;
}
Firstofallweput0x0000totheFSregisterand0xfffftotheGSregister.NextwereadvaluebyaddressA20_TEST_ADDR(itis0x200)andputthisvalueintosavedvariableandctr.
Nextwewriteupdatedctrvalueintofs:gswithwrfs32function,thendelayfor1ms,andthenreadthevalueintotheGSregisterbyaddressA20_TEST_ADDR+0x10,ifit'snotzerowealreadyhaveenabledA20line.IfA20isdisabled,wetrytoenableitwithadifferentmethodwhichyoucanfindinthea20.c.Forexamplewithcallof0x15BIOSinterruptwithAH=0x2041etc.
Ifenabled_a20functionfinishedwithfail,printanerrormessageandcallfunctiondie.Youcanrememberitfromthefirstsourcecodefilewherewestarted-arch/x86/boot/header.S:
die:
LinuxInside
37Videomodeinitializationandtransitiontoprotectedmode
hlt
jmpdie
.sizedie,.-die
AftertheA20gateissuccessfullyenabled,reset_coprocessorfunctioniscalled:
outb(0,0xf0);
outb(0,0xf1);
ThisfunctionclearstheMathCoprocessorbywriting0to0xf0andthenresetsitbywriting0to0xf1.
Afterthismask_all_interruptsfunctioniscalled:
outb(0xff,0xa1);/*MaskallinterruptsonthesecondaryPIC*/
outb(0xfb,0x21);/*MaskallbutcascadeontheprimaryPIC*/
ThismasksallinterruptsonthesecondaryPIC(ProgrammableInterruptController)andprimaryPICexceptforIRQ2ontheprimaryPIC.
Andafterallofthesepreparations,wecanseeactualtransitionintoprotectedmode.
NowwesetuptheInterruptDescriptortable(IDT).setup_idt:
staticvoidsetup_idt(void)
{
staticconststructgdt_ptrnull_idt={0,0};
asmvolatile("lidtl%0"::"m"(null_idt));
}
whichsetupstheInterruptDescriptorTable(describesinterrupthandlersandetc.).FornowIDTisnotinstalled(wewillseeitlater),butnowwejustloadIDTwithlidtlinstruction.null_idtcontainsaddressandsizeofIDT,butnowtheyarejustzero.null_idtisagdt_ptrstructure,itasdefinedas:
structgdt_ptr{
u16len;
u32ptr;
}__attribute__((packed));
wherewecansee-16-bitlength(len)ofIDTand32-bitpointertoit(MoredetailsaboutIDTandinterruptionswewillseeinthenextposts).__attribute__((packed))meansherethatsizeofgdt_ptrminimumasrequired.Sosizeofthegdt_ptrwillbe6byteshereor48bits.(Nextwewillloadpointertothegdt_ptrtotheGDTRregisterandyoumightrememberfromthepreviouspostthatitis48-bitsinsize).
NextisthesetupofGlobalDescriptorTable(GDT).Wecanseesetup_gdtfunctionwhichsetsupGDT(youcanreadaboutitintheKernelbootingprocess.Part2.).Thereisdefinitionoftheboot_gdtarrayinthisfunction,whichcontainsdefinitionofthethreesegments:
SetupInterruptDescriptorTable
SetupGlobalDescriptorTable
LinuxInside
38Videomodeinitializationandtransitiontoprotectedmode
staticconstu64boot_gdt[]__attribute__((aligned(16)))={
[GDT_ENTRY_BOOT_CS]=GDT_ENTRY(0xc09b,0,0xfffff),
[GDT_ENTRY_BOOT_DS]=GDT_ENTRY(0xc093,0,0xfffff),
[GDT_ENTRY_BOOT_TSS]=GDT_ENTRY(0x0089,4096,103),
};
Forcode,dataandTSS(TaskStateSegment).Wewillnotusetaskstatesegmentfornow,itwasaddedtheretomakeIntelVThappyaswecanseeinthecommentline(ifyou'reinterestingyoucanfindcommitwhichdescribesit-here).Let'slookonboot_gdt.Firstofallnotethatithas__attribute__((aligned(16)))attribute.Itmeansthatthisstructurewillbealignedby16bytes.Let'slookatasimpleexample:
#include<stdio.h>
structaligned{
inta;
}__attribute__((aligned(16)));
structnonaligned{
intb;
};
intmain(void)
{
structaligneda;
structnonalignedna;
printf("Notaligned-%zu\n",sizeof(na));
printf("Aligned-%zu\n",sizeof(a));
return0;
}
Technicallystructurewhichcontainsoneintfield,mustbe4bytes,butherealignedstructurewillbe16bytes:
$gcctest.c-otest&&test
Notaligned-4
Aligned-16
GDT_ENTRY_BOOT_CShasindex-2here,GDT_ENTRY_BOOT_DSisGDT_ENTRY_BOOT_CS+1andetc.Itstartsfrom2,becausefirstisamandatorynulldescriptor(index-0)andthesecondisnotused(index-1).
GDT_ENTRYisamacrowhichtakesflags,baseandlimitandbuildsGDTentry.Forexamplelet'slookonthecodesegmententry.GDT_ENTRYtakesfollowingvalues:
base-0limit-0xfffffflags-0xc09b
Whatdoesitmean?Segment'sbaseaddressis0,limit(sizeofsegment)is-0xffff(1MB).Let'slookonflags.Itis0xc09banditwillbe:
1100000010011011
inbinary.Let'strytounderstandwhateverybitmeans.Wewillgothroughallbitsfromlefttoright:
1-(G)granularitybit1-(D)if016-bitsegment;1=32-bitsegment
LinuxInside
39Videomodeinitializationandtransitiontoprotectedmode
0-(L)executedin64bitmodeif10-(AVL)availableforusebysystemsoftware0000-4bitlength19:16bitsinthedescriptor1-(P)segmentpresenceinmemory00-(DPL)-privilegelevel,0isthehighestprivilege1-(S)codeordatasegment,notasystemsegment101-segmenttypeexecute/read/1-accessedbit
YoucanreadmoreabouteverybitinthepreviouspostorintheIntel®64andIA-32ArchitecturesSoftwareDeveloper'sManuals3A.
AfterthiswegetlengthofGDTwith:
gdt.len=sizeof(boot_gdt)-1;
Wegetsizeofboot_gdtandsubtract1(thelastvalidaddressintheGDT).
NextwegetpointertotheGDTwith:
gdt.ptr=(u32)&boot_gdt+(ds()<<4);
Herewejustgetaddressofboot_gdtandaddittoaddressofdatasegmentleft-shiftedby4bits(rememberwe'reintherealmodenow).
LastlyweexecutelgdtlinstructiontoloadGDTintoGDTRregister:
asmvolatile("lgdtl%0"::"m"(gdt));
Itistheendofgo_to_protected_modefunction.WeloadedIDT,GDT,disableinterruptionsandnowcanswitchCPUintoprotectedmode.Thelaststepwecallprotected_mode_jumpfunctionwithtwoparameters:
protected_mode_jump(boot_params.hdr.code32_start,(u32)&boot_params+(ds()<<4));
whichisdefinedinthearch/x86/boot/pmjump.S.Ittakestwoparameters:
addressofprotectedmodeentrypointaddressofboot_params
Let'slookinsideprotected_mode_jump.AsIwroteabove,youcanfinditinthearch/x86/boot/pmjump.S.Firstparameterwillbeineaxregisterandsecondisinedx.
Firstofallweputaddressofboot_paramsintheesiregisterandaddressofcodesegmentregistercs(0x1000)inthebx.Afterthisweshiftbxby4bitsandaddaddressoflabel2toit(wewillhavephysicaladdressoflabel2inthebxafterit)andjumptolabel1.Nextweputdatasegmentandtaskstatesegmentinthecsanddiregisterswith:
movw$__BOOT_DS,%cx
Actualtransitionintoprotectedmode
LinuxInside
40Videomodeinitializationandtransitiontoprotectedmode
movw$__BOOT_TSS,%di
AsyoucanreadaboveGDT_ENTRY_BOOT_CShasindex2andeveryGDTentryis8byte,soCSwillbe2*8=16,__BOOT_DSis24etc.
NextwesetPE(ProtectionEnable)bitintheCR0controlregister:
movl%cr0,%edx
orb$X86_CR0_PE,%dl
movl%edx,%cr0
andmakelongjumptotheprotectedmode:
.byte0x66,0xea
2:.longin_pm32
.word__BOOT_CS
where
0x66istheoperand-sizeprefixwhichallowstomix16-bitand32-bitcode,0xea-isthejumpopcode,in_pm32isthesegmentoffset__BOOT_CSisthecodesegment.
Afterthiswearefinallyintheprotectedmode:
.code32
.section".text32","ax"
Let'slookatthefirststepsintheprotectedmode.Firstofallwesetupdatasegmentwith:
movl%ecx,%ds
movl%ecx,%es
movl%ecx,%fs
movl%ecx,%gs
movl%ecx,%ss
Ifyoureadwithattention,youcanrememberthatwesaved$__BOOT_DSinthecxregister.Nowwefillwithitallsegmentregistersbesidescs(csisalready__BOOT_CS).Nextwezerooutallgeneralpurposeregistersbesideseaxwith:
xorl%ecx,%ecx
xorl%edx,%edx
xorl%ebx,%ebx
xorl%ebp,%ebp
xorl%edi,%edi
Andjumptothe32-bitentrypointintheend:
jmpl*%eax
Rememberthateaxcontainsaddressofthe32-bitentry(wepasseditasfirstparameterintoprotected_mode_jump).
LinuxInside
41Videomodeinitializationandtransitiontoprotectedmode
That'sallwe'reintheprotectedmodeandstopatit'sentrypoint.Whathappensnext,wewillseeinthenextpart.
Thisistheendofthethirdpartaboutlinuxkernelinternals.Innextpartwewillseefirststepsintheprotectedmodeandtransitionintothelongmode.
Ifyouhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.Ifyoufindanymistakes,pleasesendmeaPRwithcorrectionsatlinux-internals.
VGAVESABIOSExtensionsDatastructurealignmentNon-maskableinterruptA20GCCdesignatedinitsGCCtypeattributesPreviouspart
Conclusion
Links
LinuxInside
42Videomodeinitializationandtransitiontoprotectedmode
ItisthefourthpartoftheKernelbootingprocessandwewillseefirststepsintheprotectedmode,likecheckingthatcpusupportsthelongmodeandSSE,pagingandinitializationofthepagetablesandtransitiontothelongmodeinintheendofthispart.
NOTE:willbemuchassemblycodeinthispart,soifyouhavepoorknowledge,readabookaboutit
Inthepreviouspartwestoppedatthejumptothe32-bitentrypointinthearch/x86/boot/pmjump.S:
jmpl*%eax
Recallthateaxregistercontainstheaddressofthe32-bitentrypoint.Wecanreadaboutthispointfromthelinuxkernelx86bootprotocol:
WhenusingbzImage,theprotected-modekernelwasrelocatedto0x100000
Andnowwecanmakesurethatitistrue.Let'slookonregistersvaluein32-bitentrypoint:
eax0x1000001048576
ecx0x00
edx0x00
ebx0x00
esp0x1ff5c0x1ff5c
ebp0x00x0
esi0x1447083056
edi0x00
eip0x1000000x100000
eflags0x46[PFZF]
cs0x1016
ss0x1824
ds0x1824
es0x1824
fs0x1824
gs0x1824
Wecanseeherethatcsregistercontains-0x10(asyoucanrememberfromthepreviouspart,itisthesecondindexintheGlobalDescriptorTable),eipregisteris0x100000andbaseaddressoftheallsegmentsincludecodesegmentiszero.Sowecangetphysicaladdress,itwillbe0:0x100000orjust0x100000,asinbootprotocol.Nowlet'sstartwith32-bitentrypoint.
Wecanfinddefinitionofthe32-bitentrypointinthearch/x86/boot/compressed/head_64.S:
__HEAD
.code32
ENTRY(startup_32)
....
....
Kernelbootingprocess.Part4.
Transitionto64-bitmode
32-bitentrypoint
LinuxInside
43Transitionto64-bitmode
....
ENDPROC(startup_32)
Firstofallwhycompresseddirectory?Actuallybzimageisagzippedvmlinux+header+kernelsetupcode.Wesawthekernelsetupcodeintheallofpreviousparts.So,themaingoalofthehead_64.Sistoprepareforenteringlongmode,enterintoitanddecompressthekernel.Wewillseeallofthesestepsbesideskerneldecompressioninthispart.
Alsoyoucannotethattherearetwofilesinthearch/x86/boot/compresseddirectory:
head_32.Shead_64.S
Wewillseeonlyhead_64.Sbecausewearelearninglinuxkernelforx86_64.head_32.Sevennotcompiledinourcase.Let'slookonthearch/x86/boot/compressed/Makefile,wecanseetherefollowingtarget:
vmlinux-objs-y:=$(obj)/vmlinux.lds$(obj)/head_$(BITS).o$(obj)/misc.o\
$(obj)/string.o$(obj)/cmdline.o\
$(obj)/piggy.o$(obj)/cpuflags.o
Noteon$(obj)/head_$(BITS).o.Itmeansthatcompilationofthehead_{32,64}.odependsonvalueofthe$(BITS).WecanfinditintheotherMakefile-arch/x86/kernel/Makefile:
ifeq($(CONFIG_X86_32),y)
BITS:=32
...
...
else
...
...
BITS:=64
endif
Nowweknowwheretostart,solet'sdoit.
Asiwroteabove,westartinthearch/x86/boot/compressed/head_64.S.Firstofallwecanseebeforestartup_32definition:
__HEAD
.code32
ENTRY(startup_32)
__HEADdefinedintheinclude/linux/init.handlooksas:
#define__HEAD.section".head.text","ax"
Wecanfindthissectioninthearch/x86/boot/compressed/vmlinux.lds.Slinkerscript:
SECTIONS
{
.=0;
.head.text:{
_head=.;
Reloadthesegmentsifneed
LinuxInside
44Transitionto64-bitmode
HEAD_TEXT
_ehead=.;
}
Noteon.=0;..isaspecialvariableoflinker-locationcounter.Assigningavaluetoit,isanoffsetrelativetotheoffsetofthesegment.Asweassignzerotoit,wecanreadfromcomments:
Becarefulpartsofhead_64.Sassumestartup_32isataddress0.
Ok,nowweknowwhereweare,andnowthebesttimetolookinsidethestartup_32function.
Inthestartofthestartup_32wecanseethecldinstructionwhichclearsDFflag.Afterthis,stringoperationslikestosbandotherwillincrementtheindexregistersesioredi.
TheNextwecanseethecheckofKEEP_SEGMENTSflagfromloadflags.Ifyourememberwealreadysawloadflagsinthearch/x86/boot/head.S(therewecheckedflagCAN_USE_HEAP).NowweneedtocheckKEEP_SEGMENTSflag.Wecanfinddescriptionofthisflaginthelinuxbootprotocol:
Bit6(write):KEEP_SEGMENTS
Protocol:2.07+
-If0,reloadthesegmentregistersinthe32bitentrypoint.
-If1,donotreloadthesegmentregistersinthe32bitentrypoint.
Assumethat%cs%ds%ss%esareallsettoflatsegmentswith
abaseof0(ortheequivalentfortheirenvironment).
andifKEEP_SEGMENTSisnotset,weneedtosetds,ssandesregisterstoflatsegmentwithbase0.Thatwedo:
testb$(1<<6),BP_loadflags(%esi)
jnz1f
cli
movl$(__BOOT_DS),%eax
movl%eax,%ds
movl%eax,%es
movl%eax,%ss
rememberthat__BOOT_DSis0x18(indexofdatasegmentintheGlobalDescriptorTable).IfKEEP_SEGMENTSisnotset,wejumptothelabel1forupdatesegmentregisterswith__BOOT_DSifthisflagisset.
Ifyoureadpreviousthepart,youcanrememberthatwealreadyupdatedsegmentregistersinthearch/x86/boot/pmjump.S,sowhyweneedtosetupitagain?Actuallylinuxkernelhasalso32-bitbootprotocol,sostartup_32canbefirstfunctionwhichwillbeexecutedrightafterabootloadertransferscontroltothekernel.
AswecheckedKEEP_SEGMENTSflagandputthecorrectvaluetothesegmentregisters,nextstepiscalculatedifferencebetweenwhereweloadedandcompiledtorun(rememberthatsetup.ld.Scontains.=0atthestartofthesection):
leal(BP_scratch+4)(%esi),%esp
call1f
1:popl%ebp
subl$1b,%ebp
Hereesiregistercontainsaddressoftheboot_paramsstructure.boot_paramscontainsspecialfieldscratchwithoffset0x1e4.Wearegettingaddressofthescratchfield+4bytesandputittotheespregister(wewilluseitasstackforthesecalculations).Afterthiswecanseecallinstructionand1flabelasoperandofit.Whatdoesitmeancall?Itmeansthatit
LinuxInside
45Transitionto64-bitmode
pushesebpvalueinthestack,nextespvalue,nextfunctionargumentsandreturnaddressintheend.Afterthiswepopreturnaddressfromthestackintoebpregister(ebpwillcontainreturnaddress)andsubtractaddressofthepreviouslabel1.
Afterthiswehaveaddresswhereweloadedintheebp-0x100000.
NowwecansetupthestackandverifyCPUthatithassupportofthelongmodeandSSE.
Thenextwecanseeassemblycodewhichsetupsnewstackforkerneldecompression:
movl$boot_stack_end,%eax
addl%ebp,%eax
movl%eax,%esp
boots_stack_endisinthe.bsssection,wecanseedefinitionofitintheendofhead_64.S:
.bss
.balign4
boot_heap:
.fillBOOT_HEAP_SIZE,1,0
boot_stack:
.fillBOOT_STACK_SIZE,1,0
boot_stack_end:
Firstofallweputaddressoftheboot_stack_endintoeaxregisterandaddtoitvalueoftheebp(rememberthatebpnowcontainsaddresswhereweloaded-0x100000).Intheendwejustputeaxvalueintoespandthat'sall,wehavecorrectstackpointer.
ThenextstepisCPUverification.NeedtocheckthatCPUhassupportoflongmodeandSSE:
callverify_cpu
testl%eax,%eax
jnzno_longmode
Itjustcallsverify_cpufunctionfromthearch/x86/kernel/verify_cpu.Swhichcontainsacoupleofcallsofthecpuidinstruction.cpuidisinstructionwhichisusedforgettinginformationaboutprocessor.InourcaseitcheckslongmodeandSSEsupportandreturns0onsuccessor1onfailintheeaxregister.
Ifeaxisnotzero,wejumptotheno_longmodelabelwhichjuststopstheCPUwithhltinstructionwhileanyhardwareinterruptwillnothappen.
no_longmode:
1:
hlt
jmp1b
Wesetstack,checkedCPUandnowcanmoveonthenextstep.
StacksetupandCPUverification
Calculaterelocationaddress
LinuxInside
46Transitionto64-bitmode
Thenextstepiscalculatingrelocationaddressfordecompressionifneed.Wecanseefollowingassemblycode:
#ifdefCONFIG_RELOCATABLE
movl%ebp,%ebx
movlBP_kernel_alignment(%esi),%eax
decl%eax
addl%eax,%ebx
notl%eax
andl%eax,%ebx
cmpl$LOAD_PHYSICAL_ADDR,%ebx
jge1f
#endif
movl$LOAD_PHYSICAL_ADDR,%ebx
1:
addl$z_extract_offset,%ebx
FirstofallnoteonCONFIG_RELOCATABLEmacro.Thisconfigurationoptiondefinedinthearch/x86/Kconfigandaswecanreadfromit'sdescription:
Thisbuildsakernelimagethatretainsrelocationinformation
soitcanbeloadedsomeplacebesidesthedefault1MB.
Note:IfCONFIG_RELOCATABLE=y,thenthekernelrunsfromtheaddress
ithasbeenloadedatandthecompiletimephysicaladdress
(CONFIG_PHYSICAL_START)isusedastheminimumlocation.
Inshortwords,thiscodecalculatesaddresswheretomovekernelfordecompressionputittoebxregisterifthekernelisrelocatableorbzimagewilldecompressitselfaboveLOAD_PHYSICAL_ADDR.
Let'slookonthecode.IfwehaveCONFIG_RELOCATABLE=ninourkernelconfigurationfile,itjustputsLOAD_PHYSICAL_ADDRtotheebxregisterandaddsz_extract_offsettoebx.Asebxiszerofornow,itwillcontainz_extract_offset.Nowlet'strytounderstandthesetwovalues.
LOAD_PHYSICAL_ADDRisthemacrowhichdefinedinthearch/x86/include/asm/boot.handitlookslikethis:
#defineLOAD_PHYSICAL_ADDR((CONFIG_PHYSICAL_START\
+(CONFIG_PHYSICAL_ALIGN-1))\
&~(CONFIG_PHYSICAL_ALIGN-1))
Herewecalculatesalignedaddresswherekernelisloaded(0x100000or1megabyteinourcase).PHYSICAL_ALIGNisanalignmentvaluetowhichkernelshouldbealigned,itrangesfrom0x200000to0x1000000forx86_64.Withthedefaultvalueswewillget2megabytesintheLOAD_PHYSICAL_ADDR:
>>>0x100000+(0x200000-1)&~(0x200000-1)
2097152
Afterthatwegotalignmentunit,weaddsz_extract_offset(whichis0xe5c000inmycase)tothe2megabytes.Intheendwewillget17154048byteoffset.Youcanfindz_extract_offsetinthearch/x86/boot/compressed/piggy.S.Thisfilegeneratedincompiletimebymkpiggyprogram.
Nowlet'strytounderstandthecodeifCONFIG_RELOCATABLEisy.
Firstofallweputebpvaluetotheebx(rememberthatebpcontainsaddresswhereweloaded)andkernel_alignmentfieldfromkernelsetupheadertotheeaxregister.kernel_alignmentisaphysicaladdressofalignmentrequiredforthekernel.Nextwedothesameasinthepreviouscase(whenkernelisnotrelocatable),butwejustusevalueofthekernel_alignmentfieldasalignunitandebx(addresswhereweloaded)asbaseaddressinsteadofCONFIG_PHYSICAL_ALIGN
LinuxInside
47Transitionto64-bitmode
andLOAD_PHYSICAL_ADDR.
Afterthatwecalculatedaddress,wecompareitwithLOAD_PHYSICAL_ADDRandaddz_extract_offsettoitagainorputLOAD_PHYSICAL_ADDRintheebxifcalculatedaddressislessthanweneed.
Afterallofthiscalculationwewillhaveebpwhichcontainsaddresswhereweloadedandebxwithaddresswheretomovekernelfordecompression.
Nowweneedtodothelastpreparationsbeforewecanseetransitiontothe64-bitmode.AtfirstweneedtoupdateGlobalDescriptorTableforthis:
lealgdt(%ebp),%eax
movl%eax,gdt+2(%ebp)
lgdtgdt(%ebp)
Hereweputtheaddressfromebpwithgdtoffsettoeaxregister,nextweputthisaddressintoebpwithoffsetgdt+2andloadGlobalDescriptorTablewiththelgdtinstruction.
Let'slookonGlobalDescriptorTabledefinition:
.data
gdt:
.wordgdt_end-gdt
.longgdt
.word0
.quad0x0000000000000000/*NULLdescriptor*/
.quad0x00af9a000000ffff/*__KERNEL_CS*/
.quad0x00cf92000000ffff/*__KERNEL_DS*/
.quad0x0080890000000000/*TSdescriptor*/
.quad0x0000000000000000/*TScontinued*/
Itdefinedinthesamefileinthe.datasection.Itcontains5descriptors:nulldescriptor,forkernelcodesegment,kerneldatasegmentandtwotaskdescriptors.WealreadyloadedGDTinthepreviouspart,we'redoingalmostthesamehere,butdescriptorswithCS.L=1andCS.D=0forexecutioninthe64bitmode.
AfterwehaveloadedGlobalDescriptorTable,wemustenablePAEmodewithputtingvalueofcr4registerintoeax,setting5bitinitandloaditagaininthecr4:
movl%cr4,%eax
orl$X86_CR4_PAE,%eax
movl%eax,%cr4
Nowwefinishedalmostwithallpreparationsbeforewecanmoveinto64-bitmode.Thelaststepistobuildpagetables,butbeforesomeinformationaboutlongmode.
Longmodeisthenativemodeforx86_64processors.Firstofalllet'slookonsomedifferencebetweenx86_64andx86.
Itprovidessomefeaturesas:
Preparationbeforeenteringlongmode
Longmode
LinuxInside
48Transitionto64-bitmode
New8generalpurposeregistersfromr8tor15+allgeneralpurposeregistersare64-bitnow64-bitinstructionpointer-RIPNewoperatingmode-Longmode64-BitAddressesandOperandsRIPRelativeAddressing(wewillseeexampleifitinthenextparts)
Longmodeisanextensionoflegacyprotectedmode.Itconsistsfromtwosub-modes:
64-bitmodecompatibilitymode
Toswitchinto64-bitmodeweneedtodofollowingthings:
enablePAE(wealreadydidit,seeabove)buildpagetablesandloadtheaddressoftoplevelpagetableintocr3registerenableEFER.LMEenablepaging
WealreadyenabledPAEwithsettingthePAEbitinthecr4register.Nowlet'slookonpaging.
Beforewecanmoveinthe64-bitmode,weneedtobuildpagetables,so,let'slookonbuildingofearly4Gbootpagetables.
NOTE:Iwillnotdescribetheoryofvirtualmemoryhere,ifyouneedtoknowmoreaboutit,seelinksintheend
Linuxkerneluses4-levelpaging,andgenerallywebuild6pagetables:
OnePML4tableOnePDPtableFourPageDirectorytables
Let'slookontheimplementationofit.Firstofallweclearbufferforthepagetablesinthememory.Everytableis4096bytes,soweneed24kilobytesbuffer:
lealpgtable(%ebx),%edi
xorl%eax,%eax
movl$((4096*6)/4),%ecx
repstosl
Weputaddresswhichstoredinebx(rememberthatebxcontainstheaddresswheretorelocatekernelfordecompression)withpgtableoffsettotheediregister.pgtabledefinedintheendofhead_64.Sandlooks:
.section".pgtable","a",@nobits
.balign4096
pgtable:
.fill6*4096,1,0
Itisinthe.pgtablesectionanditsizeis24kilobytes.Afterweputaddresstotheedi,wezeroouteaxregisterandwriteszerostothebufferwithrepstoslinstruction.
Nowwecanbuildtoplevelpagetable-PML4with:
Earlypagetablesinitialization
LinuxInside
49Transitionto64-bitmode
lealpgtable+0(%ebx),%edi
leal0x1007(%edi),%eax
movl%eax,0(%edi)
Herewegetaddresswhichstoredintheebxwithpgtableoffsetandputittotheedi.Nextweputthisaddresswithoffset0x1007totheeaxregister.0x1007is4096bytes(sizeofthePML4)+7(PML4entryflags-PRESENT+RW+USER)andputseaxtotheedi.AfterthismanipulationsediwillcontaintheaddressofthefirstPageDirectoryPointerEntrywithflags-PRESENT+RW+USER.
Inthenextstepwebuild4PageDirectoryentryinthePageDirectoryPointertable,wherefirstentrywillbewith0x7flagsandotherwith0x8:
lealpgtable+0x1000(%ebx),%edi
leal0x1007(%edi),%eax
movl$4,%ecx
1:movl%eax,0x00(%edi)
addl$0x00001000,%eax
addl$8,%edi
decl%ecx
jnz1b
Weputbaseaddressofthepagedirectorypointertabletotheediandaddressofthefirstpagedirectorypointerentrytotheeax.Put4totheecxregister,itwillbecounterinthefollowingloopandwritetheaddressofthefirstpagedirectorypointertableentrytotheediregister.
Afterthisediwillcontainaddressofthefirstpagedirectorypointerentrywithflags0x7.Nextwejustcalculatesaddressoffollowingpagedirectorypointerentrieswithflags0x8andwritestheiraddressestotheedi.
Thenextstepisbuildingof2048pagetableentriesby2megabytes:
lealpgtable+0x2000(%ebx),%edi
movl$0x00000183,%eax
movl$2048,%ecx
1:movl%eax,0(%edi)
addl$0x00200000,%eax
addl$8,%edi
decl%ecx
jnz1b
Herewedoalmostthesamethatinthepreviousexample,justfirstentrywillbewithflags-$0x00000183-PRESENT+WRITE+MBZandallanotherwith0x8.Intheendwewillhave2048pagesby2megabytes.
Ourearlypagetablestructurearedone,itmaps4gigabytesofmemoryandnowwecanputaddressofthehigh-levelpagetable-PML4tothecr3controlregister:
lealpgtable(%ebx),%eax
movl%eax,%cr3
That'sallnowwecanseetransitiontothelongmode.
FirstofallweneedtosetEFER.LMEflagintheMSRto0xC0000080:
Transitiontothelongmode
LinuxInside
50Transitionto64-bitmode
movl$MSR_EFER,%ecx
rdmsr
btsl$_EFER_LME,%eax
wrmsr
HereweputMSR_EFERflag(whichdefinedinthearch/x86/include/uapi/asm/msr-index.h)totheecxregisterandcallrdmsrinstructionwhichreadsMSRregister.Afterrdmsrexecuted,wewillhaveresultdataintheedx:eaxwhichdependsonecxvalue.WecheckEFER_LMEbitwithbtslinstructionandwritedatafromeaxtotheMSRregisterwithwrmsrinstruction.
Innextstepwepushaddressofthekernelsegmentcodetothestack(wedefineditintheGDT)andputaddressofthestartup_64routinetotheeax.
pushl$__KERNEL_CS
lealstartup_64(%ebp),%eax
AfterthiswepushthisaddresstothestackandenablepagingwithsettingPGandPEbitsinthecr0register:
movl$(X86_CR0_PG|X86_CR0_PE),%eax
movl%eax,%cr0
andcall:
lret
Rememberthatwepushedaddressofthestartup_64functiontothestackinthepreviousstep,andafterlretinstruction,CPUextractsaddressofitandjumpsthere.
Afterallofthesestepswe'refinallyinthe64-bitmode:
.code64
.org0x200
ENTRY(startup_64)
....
....
....
That'sall!
Thisistheendofthefourthpartlinuxkernelbootingprocess.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateanissue.
Inthenextpartwewillseekerneldecompressionandmanymore.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.
Conclusion
Links
LinuxInside
51Transitionto64-bitmode
ProtectedmodeIntel®64andIA-32ArchitecturesSoftwareDeveloper’sManual3AGNUlinkerSSEPagingModelspecificregister.fillinstructionPreviouspartPagingonosdev.orgPagingSystemsx86PagingTutorial
LinuxInside
52Transitionto64-bitmode
ThisisthefifthpartoftheKernelbootingprocessseries.Wesawtransitiontothe64-bitmodeinthepreviouspartandwewillcontinuefromthispointinthispart.Wewillseethelaststepsbeforewejumptothekernelcodeaspreparationforkerneldecompression,relocationanddirectlykerneldecompression.So...let'sstarttodiveinthekernelcodeagain.
Westoppedrightbeforejumpon64-bitentrypoint-startup_64whichlocatedinthearch/x86/boot/compressed/head_64.Ssourcecodefile.Wealreadysawthejumptothestartup_64inthestartup_32:
pushl$__KERNEL_CS
lealstartup_64(%ebp),%eax
...
...
...
pushl%eax
...
...
...
lret
inthepreviouspart,startup_64startstowork.SinceweloadedthenewGlobalDescriptorTableandtherewasCPUtransitioninothermode(64-bitmodeinourcase),wecanseesetupofthedatasegments:
.code64
.org0x200
ENTRY(startup_64)
xorl%eax,%eax
movl%eax,%ds
movl%eax,%es
movl%eax,%ss
movl%eax,%fs
movl%eax,%gs
inthebeginningofthestartup_64.Allsegmentregistersbesidescspointsnowtothedswhichis0x18(ifyoudon'tunderstandwhyitis0x18,readthepreviouspart).
Thenextstepiscomputationofdifferencebetweenwherekernelwascompiledandwhereitwasloaded:
#ifdefCONFIG_RELOCATABLE
leaqstartup_32(%rip),%rbp
movlBP_kernel_alignment(%rsi),%eax
decl%eax
addq%rax,%rbp
notq%rax
andq%rax,%rbp
cmpq$LOAD_PHYSICAL_ADDR,%rbp
jge1f
#endif
movq$LOAD_PHYSICAL_ADDR,%rbp
1:
leaqz_extract_offset(%rbp),%rbx
Kernelbootingprocess.Part5.
Kerneldecompression
Preparationbeforekerneldecompression
LinuxInside
53Kerneldecompression
rbpcontainsdecompressedkernelstartaddressandafterthiscodeexecutedrbxregisterwillcontainaddresswheretorelocatethekernelcodefordecompression.Wealreadysawcodelikethisinthestartup_32(youcanreadaboutitinthepreviouspart-Calculaterelocationaddress),butweneedtodothiscalculationagainbecausebootloadercanuse64-bitbootprotocolandstartup_32justwillnotbeexecutedinthiscase.
Inthenextstepwecanseesetupofthestackandresetofflagsregister:
leaqboot_stack_end(%rbx),%rsp
pushq$0
popfq
Asyoucanseeaboverbxregistercontainsthestartaddressofthedecompressingkernelcodeandwejustputthisaddresswithboot_stack_endoffsettotherspregister.Afterthisstackwillbecorrect.Youcanfinddefinitionoftheboot_stack_endintheendofcompressed/head_64.Sfile:
.bss
.balign4
boot_heap:
.fillBOOT_HEAP_SIZE,1,0
boot_stack:
.fillBOOT_STACK_SIZE,1,0
boot_stack_end:
Itlocatedinthe.bsssectionrightbefore.pgtable.Youcanlookatarch/x86/boot/compressed/vmlinux.lds.Stofindit.
Aswesetthestack,nowwecancopythecompressedkerneltotheaddressthatwegotabove,whenwecalculatedtherelocationaddressofthedecompressedkernel.Let'slookonthiscode:
pushq%rsi
leaq(_bss-8)(%rip),%rsi
leaq(_bss-8)(%rbx),%rdi
movq$_bss,%rcx
shrq$3,%rcx
std
repmovsq
cld
popq%rsi
Firstofallwepushrsitothestack.Weneedsavevalueofrsi,becausethisregisternowstorespointertotheboot_paramsrealmodestructure(youmustrememberthisstructure,wefilleditinthestartofkernelsetup).Intheendofthiscodewe'llrestorepointertotheboot_paramsintorsiagain.
Thenexttwoleaqinstructionscalculateseffectiveaddressoftheripandrbxwith_bss-8offsetandputittothersiandrdi.Whywecalculatethisaddresses?Actuallycompressedkernelimagelocatedbetweenthiscopyingcode(fromstartup_32tothecurrentcode)andthedecompressioncode.Youcanverifythisbylookingonthelinkerscript-arch/x86/boot/compressed/vmlinux.lds.S:
.=0;
.head.text:{
_head=.;
HEAD_TEXT
_ehead=.;
}
.rodata..compressed:{
*(.rodata..compressed)
}
.text:{
LinuxInside
54Kerneldecompression
_text=.;/*Text*/
*(.text)
*(.text.*)
_etext=.;
}
Notethat.head.textsectioncontainsstartup_32.Youcanrememberitfromthepreviouspart:
__HEAD
.code32
ENTRY(startup_32)
...
...
...
.textsectioncontainsdecompressioncode:
assembly
.text
relocated:
...
...
...
/*
*Dothedecompression,andjumptothenewkernel..
*/
...
And.rodata..compressedcontainscompressedkernelimage.
Sorsiwillcontainriprelativeaddressofthe_bss-8andrdiwillcontainrelocationrelativeaddressofthe _bss-
8.Aswestoretheseaddressesinregister,weputtheaddressof_bsstothercxregister.Asyoucanseeinthevmlinux.lds.S,itlocatedintheendofallsectionswiththesetup/kernelcode.Nowwecanstarttocopydatafromrsitordiby8byteswithmovsqinstruction.
Notethatthereisstdinstructionbeforedatacopying,itsetsDFflaganditmeansthatrsiandrdiwillbedecremetedorinotherwords,wewillcrbxopybytesinbackwards.
IntheendweclearDFflagwithcldinstructionandrestoreboot_paramsstructuretothersi.
Afteritweget.textsectionaddressandjumptoit:
leaqrelocated(%rbx),%rax
jmp*%rax
.textsectionsstartswiththerelocatedlabel.Forthestartthereisclearingofthebsssectionwith:
xorl%eax,%eax
leaq_bss(%rip),%rdi
leaq_ebss(%rip),%rcx
subq%rdi,%rcx
shrq$3,%rcx
repstosq
Lastpreparationbeforekerneldecompression
LinuxInside
55Kerneldecompression
Herewejustcleareax,putRIPrelativeaddressofthe_bsstotherdiand_ebsstorcxandfillitwithzeroswithrepstosqinstructions.
Intheendwecanseethecallofthedecompress_kernelroutine:
pushq%rsi
movq$z_run_size,%r9
pushq%r9
movq%rsi,%rdi
leaqboot_heap(%rip),%rsi
leaqinput_data(%rip),%rdx
movl$z_input_len,%ecx
movq%rbp,%r8
movq$z_output_len,%r9
calldecompress_kernel
popq%r9
popq%rsi
Againwesaversiwithpointertoboot_paramsstructureandcalldecompress_kernelfromthearch/x86/boot/compressed/misc.cwithsevenarguments.Allargumentswillbepassedthroughtheregisters.Wefinishedallpreparationandnowcanlookonthekerneldecompression.
Asiwroteabove,decompress_kernelfunctionisinthearch/x86/boot/compressed/misc.csourcecodefile.Thisfunctionstartswiththevideo/consoleinitializationthatwesawinthepreviousparts.Thiscallsneedifbootloadedused32or64-bitprotocols.Afterthiswestorepointerstothestartofthefreememoryandtotheendofit:
free_mem_ptr=heap;
free_mem_end_ptr=heap+BOOT_HEAP_SIZE;
whereheapisthesecondparameterofthedecompress_kernelfunctionwhichwegotwith:
leaqboot_heap(%rip),%rsi
Asyousawaboutboot_heapdefinedas:
boot_heap:
.fillBOOT_HEAP_SIZE,1,0
whereBOOT_HEAP_SIZEis0x400000ifthekernelcompressedwithbzip2or0x8000ifnot.
Inthenextstepwecallchoose_kernel_locationfunctionfromthearch/x86/boot/compressed/aslr.c.Aswecanunderstandfromthefunctionnameitchoosesmemorylocationwheretodecompressthekernelimage.Let'slookonthisfunction.
Atthestartchoose_kernel_locationtriestofindkaslroptioninthecommandlineifCONFIG_HIBERNATIONissetandnokaslroptionifthisconfigurationoptionCONFIG_HIBERNATIONisnotset:
#ifdefCONFIG_HIBERNATION
if(!cmdline_find_option_bool("kaslr")){
debug_putstr("KASLRdisabledbydefault...\n");
gotoout;
Kerneldecompression
LinuxInside
56Kerneldecompression
}
#else
if(cmdline_find_option_bool("nokaslr")){
debug_putstr("KASLRdisabledbycmdline...\n");
gotoout;
}
#endif
Ifthereisnokaslrornokaslrinthecommandlineitjumpstooutlabel:
out:
return(unsignedchar*)choice;
whichjustreturnstheoutputparameterwhichwepassedtothechoose_kernel_locationwithoutanychanges.Let'strytounderstandwhatisitkaslr.Wecanfindinformationaboutitinthedocumentation:
kaslr/nokaslr[X86]
Enable/disablekernelandmodulebaseoffsetASLR
(AddressSpaceLayoutRandomization)ifbuiltinto
thekernel.WhenCONFIG_HIBERNATIONisselected,
kASLRisdisabledbydefault.WhenkASLRisenabled,
hibernationwillbedisabled.
Itmeansthatwecanpasskaslroptiontothekernel'scommandlineandgetrandomaddressforthedecompressedkernel(moreaboutaslryoucanreadhere).
Let'sconsiderthecasewhenkernel'scommandlinecontainskaslroption.
Thereisthecallofthemem_avoid_initfunctionfromthesameaslr.csourcecodefile.Thisfunctiongetstheunsafememoryregions(initrd,kernelcommandlineandetc...).Weneedtoknowaboutthismemoryregionstonotoverlapthemwiththekernelafterdecompression.Forexample:
initrd_start=(u64)real_mode->ext_ramdisk_image<<32;
initrd_start|=real_mode->hdr.ramdisk_image;
initrd_size=(u64)real_mode->ext_ramdisk_size<<32;
initrd_size|=real_mode->hdr.ramdisk_size;
mem_avoid[1].start=initrd_start;
mem_avoid[1].size=initrd_size;
Herewecanseecalculationoftheinitrdstartaddressandsize.ext_ramdisk_imageishigh32-bitsoftheramdisk_imagefieldfrombootheaderandext_ramdisk_sizeishigh32-bitsoftheramdisk_sizefieldfrombootprotocol:
OffsetProtoNameMeaning
/Size
...
...
...
0218/42.00+ramdisk_imageinitrdloadaddress(setbybootloader)
021C/42.00+ramdisk_sizeinitrdsize(setbybootloader)
...
Andext_ramdisk_imageandext_ramdisk_sizeyoucanfindintheDocumentation/x86/zero-page.txt:
OffsetProtoNameMeaning
/Size
...
LinuxInside
57Kerneldecompression
...
...
0C0/004ALLext_ramdisk_imageramdisk_imagehigh32bits
0C4/004ALLext_ramdisk_sizeramdisk_sizehigh32bits
...
Sowe'retakingext_ramdisk_imageandext_ramdisk_size,shiftingtheylefton32(nowtheywillcontainlow32-bitsinthehigh32-bitbits)andgettingstartaddressoftheinitrdandsizeofit.Afterthiswestorethesevaluesinthemem_avoidarraywhichdefinedas:
#defineMEM_AVOID_MAX5
staticstructmem_vectormem_avoid[MEM_AVOID_MAX];
wheremem_vectorstructureis:
structmem_vector{
unsignedlongstart;
unsignedlongsize;
};
Thenextstepafterwecollectedallunsafememoryregionsinthemem_avoidarraywillbesearchoftherandomaddresswhichdoesnotoverlapwiththeunsaferegionswiththefind_random_addrfunction.
Firstofallwecanseealignoftheoutputaddressinthefind_random_addrfunction:
minimum=ALIGN(minimum,CONFIG_PHYSICAL_ALIGN);
youcanrememberCONFIG_PHYSICAL_ALIGNconfigurationoptionfromthepreviouspart.Thisoptionprovidesthevaluetowhichkernelshouldbealignedanditis0x200000bydefault.Afterthatwegotalignedoutputaddress,wegothroughthememoryandcollectregionswhicharegoodfordecompressedkernelimage:
for(i=0;i<real_mode->e820_entries;i++){
process_e820_entry(&real_mode->e820_map[i],minimum,size);
}
Youcanrememberthatwecollectede820_entriesinthesecondpartoftheKernelbootingprocesspart2.
Firstofallprocess_e820_entryfunctiondoessomechecksthate820memoryregionisnotnon-RAM,thatthestartaddressofthememoryregionisnotbiggerthanMaximumallowedaslroffsetandthatmemoryregionisnotlessthanvalueofkernelalignment:
structmem_vectorregion,img;
if(entry->type!=E820_RAM)
return;
if(entry->addr>=CONFIG_RANDOMIZE_BASE_MAX_OFFSET)
return;
if(entry->addr+entry->size<minimum)
return;
Afterthis,westoree820memoryregionstartaddressandthesizeinthemem_vectorstructure(wesawdefinitionofthisstructureabove):
LinuxInside
58Kerneldecompression
region.start=entry->addr;
region.size=entry->size;
Aswestorethesevalues,wealigntheregion.startaswediditinthefind_random_addrfunctionandcheckthatwedidn'tgetaddressthatbiggerthanoriginalmemoryregion:
region.start=ALIGN(region.start,CONFIG_PHYSICAL_ALIGN);
if(region.start>entry->addr+entry->size)
return;
NextwegetdifferencebetweentheoriginaladdressandalignedandcheckthatifthelastaddressinthememoryregionisbiggerthanCONFIG_RANDOMIZE_BASE_MAX_OFFSET,wereducethememoryregionsizethatendofkernelimagewillbelessthanmaximumaslroffset:
region.size-=region.start-entry->addr;
if(region.start+region.size>CONFIG_RANDOMIZE_BASE_MAX_OFFSET)
region.size=CONFIG_RANDOMIZE_BASE_MAX_OFFSET-region.start;
Intheendwegothroughtheallunsafememoryregionsandcheckthatthisregiondoesnotoverlapunsafeareswithkernelcommandline,initrdandetc...:
for(img.start=region.start,img.size=image_size;
mem_contains(®ion,&img);
img.start+=CONFIG_PHYSICAL_ALIGN){
if(mem_avoid_overlap(&img))
continue;
slots_append(img.start);
}
Ifmemoryregiondoesnotoverlapunsaferegionswecallslots_appendfunctionwiththestartaddressoftheregion.slots_appendfunctionjustcollectsstartaddressesofmemoryregionstotheslotsarray:
slots[slot_max++]=addr;
whichdefinedas:
staticunsignedlongslots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET/
CONFIG_PHYSICAL_ALIGN];
staticunsignedlongslot_max;
Afterprocess_e820_entrywillbeexecuted,wewillhavearrayoftheaddresseswhicharesafeforthedecompressedkernel.Nextwecallslots_fetch_randomfunctionforgettingrandomitemfromthisarray:
if(slot_max==0)
return0;
returnslots[get_random_long()%slot_max];
whereget_random_longfunctionchecksdifferentCPUflagsasX86_FEATURE_RDRANDorX86_FEATURE_TSCandchooses
LinuxInside
59Kerneldecompression
methodforgettingrandomnumber(itcanbeobtainwithRDRANDinstruction,Timestampcounter,programmableintervaltimerandetc...).Afterthatwegotrandomaddressexecutionofthechoose_kernel_locationisfinished.
Nowlet'sbacktothemisc.c.Afterwegotaddressforthekernelimage,thereneedtodosomecheckstobesurethatgottenrandomaddressiscorrectlyalignedandaddressisnotwrong.
Afterallthesecheckswillseethefamiliarmessage:
DecompressingLinux...
andcalldecompressfunctionwhichwilldecompressthekernel.decompressfunctiondependsonwhatdecompressionalgorithmwaschosenduringkernelcompilartion:
#ifdefCONFIG_KERNEL_GZIP
#include"../../../../lib/decompress_inflate.c"
#endif
#ifdefCONFIG_KERNEL_BZIP2
#include"../../../../lib/decompress_bunzip2.c"
#endif
#ifdefCONFIG_KERNEL_LZMA
#include"../../../../lib/decompress_unlzma.c"
#endif
#ifdefCONFIG_KERNEL_XZ
#include"../../../../lib/decompress_unxz.c"
#endif
#ifdefCONFIG_KERNEL_LZO
#include"../../../../lib/decompress_unlzo.c"
#endif
#ifdefCONFIG_KERNEL_LZ4
#include"../../../../lib/decompress_unlz4.c"
#endif
Afterkernelwillbedecompressed,thelastfunctionhandle_relocationswillrelocatethekerneltotheaddressthatwegotfromchoose_kernel_location.Afterthatkernelrelocatedwereturnfromthedecompress_kerneltothehead_64.S.Theaddressofthekernelwillbeintheraxregisterandwejumponit:
jmp*%rax
That'sall.Nowweareinthekernel!
Thisistheendofthefifthandthelastpartaboutlinuxkernelbootingprocess.Wewillnotseepostsaboutkernelbootinganymore(maybeonlyupdatesinthisandpreviousposts),buttherewillbemanypostsaboutotherkernelinternals.
Nextchapterwillbeaboutkernelinitializationandwewillseethefirststepsinthelinuxkernelinitializationcode.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeintwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
Conclusion
LinuxInside
60Kerneldecompression
addressspacelayoutrandomizationinitrdlongmodebzip2RDdRandinstructionTimeStampCounterProgrammableIntervalTimersPreviouspart
Links
LinuxInside
61Kerneldecompression
Youwillfindhereacoupleofpostswhichdescribethefullcycleofkernelinitializationfromitsfirststepsafterthekernelhasdecompressedtothestartofthefirstprocessrunbythekernelitself.
NoteThattherewillnotbedescriptionoftheallkernelinitializationsteps.Herewillbeonlygenerickernelpart,withoutinterruptshandling,ACPI,andmanyotherparts.AllpartswhichI'llmiss,willbedescribedinotherchapters.
Firststepsafterkerneldecompression-describesfirststepsinthekernel.Earlyinterruptandexceptionhandling-describesearlyinterruptsinitializationandearlypagefaulthandler.Lastpreparationsbeforethekernelentrypoint-describesthelastpreparationsbeforethecallofthestart_kernel.Kernelentrypoint-describesfirststepsinthekernelgenericcode.Continueofarchitecture-specificinitializations-describesarchitecture-specificinitialization.Architecture-specificinitializations,again...-describescontinueofthearchitecture-specificinitializationprocess.TheEndofthearchitecture-specificinitializations,almost...-describestheendofthesetup_archrelatedstuff.Schedulerinitialization-describespreparationbeforeschedulerinitializationandinitializationofit.RCUinitialization-describestheinitializationoftheRCU.Endoftheinitialization-thelastpartaboutlinuxkernelinitialization.
Kernelinitializationprocess
LinuxInside
62Initialization
Inthepreviouspost(Kernelbootingprocess.Part5.)-Kerneldecompressionwestoppedatthejumponthedecompressedkernel:
jmp*%rax
andnowweareinthekernel.Therearemanythingstodobeforethekernelwillstartfirstinitprocess.Hopewewillseeallofthepreparationsbeforekernelwillstartinthisbigchapter.Wewillstartfromthekernelentrypoint,whichisinthearch/x86/kernel/head_64.S.Wewillseefirstpreparationslikeearlypagetablesinitialization,switchtoanewdescriptorinkernelspaceandmanymanymore,beforewewillseethestart_kernelfunctionfromtheinit/main.cwillbecalled.
Solet'sstart.
Okay,wegotaddressofthekernelfromthedecompress_kernelfunctionintoraxregisterandjustjumpedthere.Decompressedkernelcodestartsinthearch/x86/kernel/head_64.S:
__HEAD
.code64
.globlstartup_64
startup_64:
...
...
...
Wecanseedefinitionofthestartup_64routineanditdefinedinthe__HEADsection,whichisjust:
#define__HEAD.section".head.text","ax"
Wecanseedefinitionofthissectioninthearch/x86/kernel/vmlinux.lds.Slinkerscript:
.text:AT(ADDR(.text)-LOAD_OFFSET){
_text=.;
...
...
...
}:text=0x9090
Wecanunderstanddefaultvirtualandphysicaladdressesfromthelinkerscript.Notethataddressofthe_textislocationcounterwhichisdefinedas:
.=__START_KERNEL;
forx86_64.Wecanfinddefinitionofthe__START_KERNELmacrointhearch/x86/include/asm/page_types.h:
Kernelinitialization.Part1.
Firststepsinthekernelcode
Firststepsinthekernel
LinuxInside
63Firststepsinthekernel
#define__START_KERNEL(__START_KERNEL_map+__PHYSICAL_START)
#define__PHYSICAL_STARTALIGN(CONFIG_PHYSICAL_START,CONFIG_PHYSICAL_ALIGN)
Herewecanseethat__START_KERNEListhesumofthe__START_KERNEL_map(whichis0xffffffff80000000,seepostaboutpaging)and__PHYSICAL_START.Where__PHYSICAL_STARTisalignedvalueoftheCONFIG_PHYSICAL_START.SoifyouwillnotusekASLRandwillnotchangeCONFIG_PHYSICAL_STARTintheconfigurationaddresseswillbefollowing:
Physicaladdress-0x1000000;Virtualaddress-0xffffffff81000000.
Nowweknowdefaultphysicalandvirtualaddressesofthestartup_64routine,buttoknowactualaddresseswemusttocalculateitwiththefollowingcode:
leaq_text(%rip),%rbp
subq$_text-__START_KERNEL_map,%rbp
Herewejustputtherip-relativeaddresstotherbpregisterandthensubtract$_text-__START_KERNEL_mapfromit.Weknowthatcompiledaddressofthe_textis0xffffffff81000000and__START_KERNEL_mapcontains0xffffffff81000000,sorbpwillcontainphysicaladdressofthetext-0x1000000afterthiscalculation.Weneedtocalculateitbecausekernelcan'tberunonthedefaultaddress,butnowweknowtheactualphysicaladdress.
Inthenextstepwechecksthatthisaddressisalignedwith:
movq%rbp,%rax
andl$~PMD_PAGE_MASK,%eax
testl%eax,%eax
jnzbad_address
Herewejustputaddresstothe%raxandtestfirstbit.PMD_PAGE_MASKindicatesthemaskforPagemiddledirectory(readpagingaboutit)anddefinedas:
#definePMD_PAGE_MASK(~(PMD_PAGE_SIZE-1))
#definePMD_PAGE_SIZE(_AC(1,UL)<<PMD_SHIFT)
#definePMD_SHIFT21
Aswecaneasilycalculate,PMD_PAGE_SIZEis2megabytes.Hereweusestandardformulaforcheckingalignmentandiftextaddressisnotalignedfor2megabytes,wejumptobad_addresslabel.
Afterthiswecheckaddressthatitisnottoolarge:
leaq_text(%rip),%rax
shrq$MAX_PHYSMEM_BITS,%rax
jnzbad_address
Addressmostnotbegreaterthan46-bits:
#defineMAX_PHYSMEM_BITS46
Okay,wedidsomeearlychecksandnowwecanmoveon.
LinuxInside
64Firststepsinthekernel
Thefirststepbeforewestartedtosetupidentitypaging,needtocorrectfollowingaddresses:
addq%rbp,early_level4_pgt+(L4_START_KERNEL*8)(%rip)
addq%rbp,level3_kernel_pgt+(510*8)(%rip)
addq%rbp,level3_kernel_pgt+(511*8)(%rip)
addq%rbp,level2_fixmap_pgt+(506*8)(%rip)
Hereweneedtocorrectearly_level4_pgtandotheraddressesofthepagetabledirectories,becauseasIwroteabove,kernelcan'tberunatthedefault0x1000000address.rbpregistercontainsactualaddresssoweaddtotheearly_level4_pgt,level3_kernel_pgtandlevel2_fixmap_pgt.Let'strytounderstandwhattheselabelsmean.Firstofalllet'slookontheirdefinition:
NEXT_PAGE(early_level4_pgt)
.fill511,8,0
.quadlevel3_kernel_pgt-__START_KERNEL_map+_PAGE_TABLE
NEXT_PAGE(level3_kernel_pgt)
.fillL3_START_KERNEL,8,0
.quadlevel2_kernel_pgt-__START_KERNEL_map+_KERNPG_TABLE
.quadlevel2_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE
NEXT_PAGE(level2_kernel_pgt)
PMDS(0,__PAGE_KERNEL_LARGE_EXEC,
KERNEL_IMAGE_SIZE/PMD_SIZE)
NEXT_PAGE(level2_fixmap_pgt)
.fill506,8,0
.quadlevel1_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE
.fill5,8,0
NEXT_PAGE(level1_fixmap_pgt)
.fill512,8,0
Lookshard,butitisnottrue.
Firstofalllet'slookontheearly_level4_pgt.Itstartswiththe(4096-8)bytesofzeros,itmeansthatwedon'tusefirst511early_level4_pgtentries.Andafterthiswecanseelevel3_kernel_pgtentry.Notethatwesubtract__START_KERNEL_map+_PAGE_TABLEfromit.Asweknow__START_KERNEL_mapisabasevirtualaddressofthekerneltext,soifwesubtract__START_KERNEL_map,wewillgetphysicaladdressofthelevel3_kernel_pgt.Nowlet'slookon_PAGE_TABLE,itisjustpageentryaccessrights:
#define_PAGE_TABLE(_PAGE_PRESENT|_PAGE_RW|_PAGE_USER|\
_PAGE_ACCESSED|_PAGE_DIRTY)
moreaboutit,youcanreadinthepagingpost.
level3_kernel_pgt-storesentrieswhichmapkernelspace.Atthestartofit'sdefinition,wecanseethatitfilledwithzerosL3_START_KERNELtimes.HereL3_START_KERNEListheindexinthepageupperdirectorywhichcontains__START_KERNEL_mapaddressanditequals510.Afteritwecanseedefinitionoftwolevel3_kernel_pgtentries:level2_kernel_pgtandlevel2_fixmap_pgt.Firstissimple,itispagetableentrywhichcontainspointertothepagemiddledirectorywhichmapskernelspaceandithas:
#define_KERNPG_TABLE(_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|\
_PAGE_DIRTY)
Fixbaseaddressesofpagetables
LinuxInside
65Firststepsinthekernel
accessrights.Thesecond-level2_fixmap_pgtisavirtualaddresseswhichcanrefertoanyphysicaladdressesevenunderkernelspace.
Thenextlevel2_kernel_pgtcallsPDMSmacrowhichcreates512megabytesfromthe__START_KERNEL_mapforkerneltext(afterthese512megabyteswillbemodulesmemoryspace).
NowweknowLet'sbacktoourcodewhichisinthebeginningofthesection.Rememberthatrbpcontainsactualphysicaladdressofthe_textsection.Wejustaddthisaddresstothebaseaddressofthepagetables,thatthey'llhavecorrectaddresses:
addq%rbp,early_level4_pgt+(L4_START_KERNEL*8)(%rip)
addq%rbp,level3_kernel_pgt+(510*8)(%rip)
addq%rbp,level3_kernel_pgt+(511*8)(%rip)
addq%rbp,level2_fixmap_pgt+(506*8)(%rip)
Atthefirstlineweaddrbptotheearly_level4_pgt,atthesecondlineweaddrbptothelevel2_kernel_pgt,atthethirdlineweaddrbptothelevel2_fixmap_pgtandaddrbptothelevel1_fixmap_pgt.
Afterallofthiswewillhave:
early_level4_pgt[511]->level3_kernel_pgt[0]
level3_kernel_pgt[510]->level2_kernel_pgt[0]
level3_kernel_pgt[511]->level2_fixmap_pgt[0]
level2_kernel_pgt[0]->512MBkernelmapping
level2_fixmap_pgt[506]->level1_fixmap_pgt
Aswecorrectedbaseaddressesofthepagetables,wecanstarttobuildit.
Nowwecanseesetuptheidentitymappingearlypagetables.IdentityMappedPagingisavirtualaddresseswhicharemappedtophysicaladdressesthathavethesamevalue,1:1.Let'slookonitindetails.Firstofallwegettherip-relativeaddressofthe_textand_early_level4_pgtandputtheyintordiandrbxregisters:
leaq_text(%rip),%rdi
leaqearly_level4_pgt(%rip),%rbx
Afterthiswestorephysicaladdressofthe_textintheraxandgettheindexofthepageglobaldirectoryentrywhichstores_textaddress,byshifting_textaddressonthePGDIR_SHIFT:
movq%rdi,%rax
shrq$PGDIR_SHIFT,%rax
leaq(4096+_KERNPG_TABLE)(%rbx),%rdx
movq%rdx,0(%rbx,%rax,8)
movq%rdx,8(%rbx,%rax,8)
wherePGDIR_SHIFTis39.PGDIR_SHFTindicatesthemaskforpageglobaldirectorybitsinavirtualaddress.Therearemacroforalltypesofpagedirectories:
#definePGDIR_SHIFT39
#definePUD_SHIFT30
#definePMD_SHIFT21
Identitymappingsetup
LinuxInside
66Firststepsinthekernel
Afterthisweputtheaddressofthefirstlevel3_kernel_pgttotherdxwiththe_KERNPG_TABLEaccessrights(seeabove)andfilltheearly_level4_pgtwiththe2level3_kernel_pgtentries.
Afterthisweadd4096(sizeoftheearly_level4_pgt)totherdx(itnowcontainstheaddressofthefirstentryofthelevel3_kernel_pgt)andputrdi(itnowcontainsphysicaladdressofthe_text)totherax.Andafterthiswewriteaddressesofthetwopageupperdirectoryentriestothelevel3_kernel_pgt:
addq$4096,%rdx
movq%rdi,%rax
shrq$PUD_SHIFT,%rax
andl$(PTRS_PER_PUD-1),%eax
movq%rdx,4096(%rbx,%rax,8)
incl%eax
andl$(PTRS_PER_PUD-1),%eax
movq%rdx,4096(%rbx,%rax,8)
Inthenextstepwewriteaddressesofthepagemiddledirectoryentriestothelevel2_kernel_pgtandthelaststepiscorrectingofthekerneltext+datavirtualaddresses:
leaqlevel2_kernel_pgt(%rip),%rdi
leaq4096(%rdi),%r8
1:testq$1,0(%rdi)
jz2f
addq%rbp,0(%rdi)
2:addq$8,%rdi
cmp%r8,%rdi
jne1b
Hereweputtheaddressofthelevel2_kernel_pgttotherdiandaddressofthepagetableentrytother8register.Nextwecheckthepresentbitinthelevel2_kernel_pgtandifitiszerowe'removingtothenextpagebyadding8bytestordiwhichcontaitnsaddressofthelevel2_kernel_pgt.Afterthiswecompareitwithr8(containsaddressofthepagetableentry)andgobacktolabel1ormoveforward.
Inthenextstepwecorrectphys_basephysicaladdresswithrbp(containsphysicaladdressofthe_text),putphysicaladdressoftheearly_level4_pgtandjumptolabel1:
addq%rbp,phys_base(%rip)
movq$(early_level4_pgt-__START_KERNEL_map),%rax
jmp1f
wherephys_basemathesthefirstentryofthelevel2_kernel_pgtwhichis512MBkernelmapping.
Afterthatwejumpedtothelabel1weenablePAE,PGE(PagingGlobalExtension)andputthephysicaladdressofthephys_base(seeabove)totheraxregisterandfillcr3registerwithit:
1:
movl$(X86_CR4_PAE|X86_CR4_PGE),%ecx
movq%rcx,%cr4
addqphys_base(%rip),%rax
movq%rax,%cr3
Lastpreparations
LinuxInside
67Firststepsinthekernel
InthenextstepwecheckthatCPUsupportNXbitwith:
movl$0x80000001,%eax
cpuid
movl%edx,%edi
Weput0x80000001valuetotheeaxandexecutecpuidinstructionforgettingextendedprocessorinfoandfeaturebits.Theresultwillbeintheedxregisterwhichweputtotheedi.
Nowweput0xc0000080orMSR_EFERtotheecxandcallrdmsrinstructionforthereadingmodelspecificregister.
movl$MSR_EFER,%ecx
rdmsr
Theresultwillbeintheedx:eax.GeneralviewoftheEFERisfollowing:
6332
--------------------------------------------------------------------------------
||
|ReservedMBZ|
||
--------------------------------------------------------------------------------
311615141312111098710
--------------------------------------------------------------------------------
||T||||||||||
|ReservedMBZ|C|FFXSR|LMSLE|SVME|NXE|LMA|MBZ|LME|RAZ|SCE|
||E||||||||||
--------------------------------------------------------------------------------
Wewillnotseeallfieldsindetailshere,butwewilllearnaboutthisandotherMSRsinthespecialpartabout.AswereadEFERtotheedx:eax,wechecks_EFER_SCEorzerobitwhichisSystemCallExtensionswithbtslinstructionandsetittoone.BythesettingSCEbitweenableSYSCALLandSYSRETinstructions.Inthenextstepwecheck20thbitintheedi,rememberthatthisregisterstoresresultofthecpuid(seeabove).If20bitisset(NXbit)wejustwriteEFER_SCEtothemodelspecificregister.
btsl$_EFER_SCE,%eax
btl$20,%edi
jnc1f
btsl$_EFER_NX,%eax
btsq$_PAGE_BIT_NX,early_pmd_flags(%rip)
1:wrmsr
IfNXbitissupportedweenable_EFER_NXandwriteittoo,withthewrmsrinstruction.
InthenextstepweneedtoupdateGlobalDescriptortablewithlgdtinstruction:
lgdtearly_gdt_descr(%rip)
whereGlobalDescriptortabledefinedas:
early_gdt_descr:
.wordGDT_ENTRIES*8-1
early_gdt_descr_base:
.quadINIT_PER_CPU_VAR(gdt_page)
LinuxInside
68Firststepsinthekernel
WeneedtoreloadGlobalDescriptorTablebecausenowkernelworksintheuserspaceaddresses,butsoonkernelwillworkinit'sownspace.Nowlet'slookonearly_gdt_descrdefinition.GlobalDescriptorTablecontains32entries:
#defineGDT_ENTRIES32
forkernelcode,data,threadlocalstoragesegmentsandetc...it'ssimple.Nowlet'slookontheearly_gdt_descr_base.Firstofgdt_pagedefinedas:
structgdt_page{
structdesc_structgdt[GDT_ENTRIES];
}__attribute__((aligned(PAGE_SIZE)));
inthearch/x86/include/asm/desc.h.Itcontainsonefieldgdtwhichisarrayofthedesc_structstructureswhichdefinedas:
structdesc_struct{
union{
struct{
unsignedinta;
unsignedintb;
};
struct{
u16limit0;
u16base0;
unsignedbase1:8,type:4,s:1,dpl:2,p:1;
unsignedlimit:4,avl:1,l:1,d:1,g:1,base2:8;
};
};
}__attribute__((packed));
andpresentsfamiliartousGDTdescriptor.Alsowecannotethatgdt_pagestructurealignedtoPAGE_SIZEwhichis4096bytes.Itmeansthatgdtwilloccupyonepage.Nowlet'strytounderstandwhatisitINIT_PER_CPU_VAR.INIT_PER_CPU_VARisamacrowhichdefinedinthearch/x86/include/asm/percpu.handjustconcatsinit_per_cpu__withthegivenparameter:
#defineINIT_PER_CPU_VAR(var)init_per_cpu__##var
Afterthiswehaveinit_per_cpu__gdt_page.Wecanseeinthelinkerscript:
#defineINIT_PER_CPU(x)init_per_cpu__##x=x+__per_cpu_load
INIT_PER_CPU(gdt_page);
Aswegotinit_per_cpu__gdt_pageinINIT_PER_CPU_VARandINIT_PER_CPUmacrofromlinkerscriptwillbeexpandedwewillgetoffsetfromthe__per_cpu_load.Afterthiscalculations,wewillhavecorrectbaseaddressofthenewGDT.
Generallyper-CPUvariablesisa2.6kernelfeature.Youcanunderstandwhatisitfromit'sname.Whenwecreateper-CPUvariable,eachCPUwillhavewillhaveit'sowncopyofthisvariable.Herewecreatinggdt_pageper-CPUvariable.Therearemanyadvantagesforvariablesofthistype,liketherearenolocks,becauseeachCPUworkswithit'sowncopyofvariableandetc...Soeverycoreonmultiprocessorwillhaveit'sownGDTtableandeveryentryinthetablewillrepresentamemorysegmentwhichcanbeaccessedfromthethreadwhichranonthecore.Youcanreadindetailsaboutper-CPUvariablesintheTheory/per-cpupost.
AsweloadednewGlobalDescriptorTable,wereloadsegmentsaswediditeverytime:
xorl%eax,%eax
LinuxInside
69Firststepsinthekernel
movl%eax,%ds
movl%eax,%ss
movl%eax,%es
movl%eax,%fs
movl%eax,%gs
Afterallofthesestepswesetupgsregisterthatitposttotheirqstack(wewillseeinformationaboutitintheupcomingparts):
movl$MSR_GS_BASE,%ecx
movlinitial_gs(%rip),%eax
movlinitial_gs+4(%rip),%edx
wrmsr
whereMSR_GS_BASEis:
#defineMSR_GS_BASE0xc0000101
WeneedtoputMSR_GS_BASEtotheecxregisterandloaddatafromtheeaxandedx(whicharepointtotheinitial_gs)withwrmsrinstruction.Wedon'tusecs,fs,dsandsssegmentregistersforaddressationinthe64-bitmode,butfsandgsregisterscanbeused.fsandgshaveahiddenpart(aswesawitintherealmodeforcs)andthispartcontainsdescriptorwhichmappedtoModelspecificregisters.Sowecanseeabove0xc0000101isags.baseMSRaddress.
Inthenextstepweputtheaddressoftherealmodebootparamstructuretotherdi(rememberrsiholdspointertothisstructurefromthestart)andjumptotheCcodewith:
movqinitial_code(%rip),%rax
pushq$0
pushq$__KERNEL_CS
pushq%rax
lretq
Hereweputtheaddressoftheinitial_codetotheraxandpushfakeaddress,__KERNEL_CSandtheaddressoftheinitial_codetothestack.Afterthiswecanseelretqinstructionwhichmeansthatafteritreturnaddresswillbeextractedfromstack(nowthereisaddressoftheinitial_code)andjumpthere.initial_codedefinedinthesamesourcecodefileandlooks:
__REFDATA
.balign8
GLOBAL(initial_code)
.quadx86_64_start_kernel
...
...
...
Aswecanseeinitial_codecontainsaddressofthex86_64_start_kernel,whichdefinedinthearch/x86/kerne/head64.candlookslikethis:
asmlinkage__visiblevoid__initx86_64_start_kernel(char*real_mode_data){
...
...
...
}
LinuxInside
70Firststepsinthekernel
Ithasoneargumentisareal_mode_data(rememberthatwepassedaddressoftherealmodedatatotherdiregisterpreviously).
ThisisfirstCcodeinthekernel!
Weneedtoseelastpreparationsbeforewecansee"kernelentrypoint"-start_kernelfunctionfromtheinit/main.c.
Firstofallwecanseesomechecksinthex86_64_start_kernelfunction:
BUILD_BUG_ON(MODULES_VADDR<__START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR-__START_KERNEL_map<KERNEL_IMAGE_SIZE);
BUILD_BUG_ON(MODULES_LEN+KERNEL_IMAGE_SIZE>2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map&~PMD_MASK)!=0);
BUILD_BUG_ON((MODULES_VADDR&~PMD_MASK)!=0);
BUILD_BUG_ON(!(MODULES_VADDR>__START_KERNEL));
BUILD_BUG_ON(!(((MODULES_END-1)&PGDIR_MASK)==(__START_KERNEL&PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses)<=MODULES_END);
Therearechecksfordifferentthingslikevirtualaddressesofmodulesspaceisnotfewerthanbaseaddressofthekerneltext-__STAT_KERNEL_map,thatkerneltextwithmodulesisnotlessthanimageofthekernelandetc...BUILD_BUG_ONisamacrowhichlooksas:
#defineBUILD_BUG_ON(condition)((void)sizeof(char[1-2*!!(condition)]))
Let'strytounderstandthistrickworks.Let'stakeforexamplefirstcondition:MODULES_VADDR<__START_KERNEL_map.!!conditionsisthesamethatcondition!=0.SoitmeansifMODULES_VADDR<__START_KERNEL_mapistrue,wewillget1inthe!!(condition)orzeroifnot.After2*!!(condition)wewillgetor2or0.Intheendofcalculationswecangettwodifferentbehaviors:
Wewillhavecompilationerror,becausetrytogetsizeofthechararraywithnegativeindex(ascanbeinourcase,becauseMODULES_VADDRcan'tbelessthan__START_KERNEL_mapwillbeinourcase);Nocompilationerrors.
That'sall.SointerestingCtrickforgettingcompileerrorwhichdependsonsomeconstants.
Inthenextstepwecanseecallofthecr4_init_shadowfunctionwhichstoresshadowcopyofthecr4percpu.Contextswitchescanchangebitsinthecr4soweneedtostorecr4foreachCPU.Andafterthiswecanseecallofthereset_early_page_tablesfunctionwhereweresetsallpageglobaldirectoryentriesandwritenewpointertothePGTincr3:
for(i=0;i<PTRS_PER_PGD-1;i++)
early_level4_pgt[i].pgd=0;
next_early_pgt=0;
write_cr3(__pa_nodebug(early_level4_pgt));
soonwewillbuildnewpagetables.HerewecanseethatwegothroughallPageGlobalDirectoryEntries(PTRS_PER_PGDis512)intheloopandmakeitzero.Afterthiswesetnext_early_pgttozero(wewillseedetailsaboutitinthenextpost)andwritephysicaladdressoftheearly_level4_pgttothecr3.__pa_nodebugisamacrowhichwillbeexpandedto:
Nexttostart_kernel
LinuxInside
71Firststepsinthekernel
((unsignedlong)(x)-__START_KERNEL_map+phys_base)
Afterthisweclear_bssfromthe__bss_stopto__bss_startandthenextstepwillbesetupoftheearlyIDThandlers,butit'sbigthemesowewillseeitinthenextpart.
Thisistheendofthefirstpartaboutlinuxkernelinitialization.
Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.
Inthenextpartwewillseeinitializationoftheearlyinterruptionhandlers,kernelspacememorymappingandalotmore.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.
ModelSpecificRegisterPagingPreviouspart-KerneldecompressionNXASLR
Conclusion
Links
LinuxInside
72Firststepsinthekernel
Inthepreviouspartwestoppedbeforesettingofearlyinterrupthandlers.Wecontinueinthispartandwillknowmoreaboutinterruptandexceptionhandling.
Rememberthatwestoppedbeforefollowingloop:
for(i=0;i<NUM_EXCEPTION_VECTORS;i++)
set_intr_gate(i,early_idt_handlers[i]);
fromthearch/x86/kernel/head64.csourcecodefile.Butbeforewestartedtosortoutthiscode,weneedtoknowaboutinterruptsandhandlers.
InterruptisaneventcausedbysoftwareorhardwaretotheCPU.Oninterrupt,CPUstopsthecurrenttaskandtransfercontroltotheinterrupthandler,whichhandlesinterruptionandtransfercontrolbacktothepreviouslystoppedtask.Wecansplitinterruptsonthreetypes:
Softwareinterrupts-whenasoftwaresignalsCPUthatitneedskernelattention.Theseinterruptsaregenerallyusedforsystemcalls;Hardwareinterrupts-whenahardwareeventhappens,forexamplebuttonispressedonakeyboard;Exceptions-interruptsgeneratedbyCPU,whentheCPUdetectserror,forexampledivisionbyzerooraccessingamemorypagewhichisnotinRAM.
Everyinterruptandexceptionisassignedauniquenumberwhichcalled-vectornumber.Vectornumbercanbeanynumberfrom0to255.Thereiscommonpracticetousefirst32vectornumbersforexceptions,andvectornumbersfrom32to255areusedforuser-definedinterrupts.Wecanseeitinthecodeabove-NUM_EXCEPTION_VECTORS,whichdefinedas:
#defineNUM_EXCEPTION_VECTORS32
CPUusesvectornumberasanindexintheInterruptDescriptorTable(wewillseedescriptionofitsoon).CPUcatchinterruptsfromtheAPICorthroughit'spins.Followingtableshows0-31exceptions:
----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description|Type|ErrorCode|Source|
----------------------------------------------------------------------------------------------
|0|#DE|DivideError|Fault|NO|DIVandIDIV|
|---------------------------------------------------------------------------------------------
|1|#DB|Reserved|F/T|NO||
|---------------------------------------------------------------------------------------------
|2|---|NMI|INT|NO|externalNMI|
|---------------------------------------------------------------------------------------------
|3|#BP|Breakpoint|Trap|NO|INT3|
|---------------------------------------------------------------------------------------------
|4|#OF|Overflow|Trap|NO|INTOinstruction|
|---------------------------------------------------------------------------------------------
|5|#BR|BoundRangeExceeded|Fault|NO|BOUNDinstruction|
|---------------------------------------------------------------------------------------------
Kernelinitialization.Part2.
Earlyinterruptandexceptionhandling
Sometheory
LinuxInside
73Earlyinterruptshandler
|6|#UD|InvalidOpcode|Fault|NO|UD2instruction|
|---------------------------------------------------------------------------------------------
|7|#NM|DeviceNotAvailable|Fault|NO|Floatingpointor[F]WAIT|
|---------------------------------------------------------------------------------------------
|8|#DF|DoubleFault|Abort|YES|AntinstrctionswhichcangenerateNMI|
|---------------------------------------------------------------------------------------------
|9|---|Reserved|Fault|NO||
|---------------------------------------------------------------------------------------------
|10|#TS|InvalidTSS|Fault|YES|TaskswitchorTSSaccess|
|---------------------------------------------------------------------------------------------
|11|#NP|SegmentNotPresent|Fault|NO|Accessingsegmentregister|
|---------------------------------------------------------------------------------------------
|12|#SS|Stack-SegmentFault|Fault|YES|Stackoperations|
|---------------------------------------------------------------------------------------------
|13|#GP|GeneralProtection|Fault|YES|Memoryreference|
|---------------------------------------------------------------------------------------------
|14|#PF|Pagefault|Fault|YES|Memoryreference|
|---------------------------------------------------------------------------------------------
|15|---|Reserved||NO||
|---------------------------------------------------------------------------------------------
|16|#MF|x87FPUfperror|Fault|NO|Floatingpointor[F]Wait|
|---------------------------------------------------------------------------------------------
|17|#AC|AlignmentCheck|Fault|YES|Datareference|
|---------------------------------------------------------------------------------------------
|18|#MC|MachineCheck|Abort|NO||
|---------------------------------------------------------------------------------------------
|19|#XM|SIMDfpexception|Fault|NO|SSE[2,3]instructions|
|---------------------------------------------------------------------------------------------
|20|#VE|Virtualizationexc.|Fault|NO|EPTviolations|
|---------------------------------------------------------------------------------------------
|21-31|---|Reserved|INT|NO|Externalinterrupts|
----------------------------------------------------------------------------------------------
ToreactoninterruptCPUusesspecialstructure-InterruptDescriptorTableorIDT.IDTisanarrayof8-bytedescriptorslikeGlobalDescriptorTable,butIDTentriesarecalledgates.CPUmultipliesvectornumberon8tofindindexoftheIDTentry.Butin64-bitmodeIDTisanarrayof16-bytedescriptorsandCPUmultipliesvectornumberon16tofindindexoftheentryintheIDT.WerememberfromthepreviouspartthatCPUusesspecialGDTRregistertolocateGlobalDescriptorTable,soCPUusesspecialregisterIDTRforInterruptDescriptorTableandlidtinstruuctionforloadingbaseaddressofthetableintothisregister.
64-bitmodeIDTentryhasfollowingstructure:
12796
--------------------------------------------------------------------------------
||
|Reserved|
||
--------------------------------------------------------------------------------
9564
--------------------------------------------------------------------------------
||
|Offset63..32|
||
--------------------------------------------------------------------------------
634847464442393432
--------------------------------------------------------------------------------
|||D|||||||
|Offset31..16|P|P|0|Type|000|0|0|IST|
|||L|||||||
--------------------------------------------------------------------------------
3115160
--------------------------------------------------------------------------------
|||
|SegmentSelector|Offset15..0|
|||
--------------------------------------------------------------------------------
Where:
LinuxInside
74Earlyinterruptshandler
Offset-isoffsettoentrypointofaninterrupthandler;DPL-DescriptorPrivilegeLevel;P-SegmentPresentflag;Segmentselector-acodesegmentselectorinGDTorLDTIST-providesabilitytoswitchtoanewstackforinterruptshandling.
AndthelastTypefielddescribestypeoftheIDTentry.Therearethreedifferentkindsofhandlersforinterrupts:
TaskdescriptorInterruptdescriptorTrapdescriptor
Interruptandtrapdescriptorscontainafarpointertotheentrypointoftheinterrupthandler.OnlyonedifferencebetweenthesetypesishowCPUhandlesIFflag.Ifinterrupthandlerwasaccessedthroughinterruptgate,CPUcleartheIFflagtopreventotherinterruptswhilecurrentinterrupthandlerexecutes.Afterthatcurrentinterrupthandlerexecutes,CPUsetstheIFflagagainwithiretinstruction.
Otherbitsreservedandmustbe0.
Nowlet'slookhowCPUhandlesinterrupts:
CPUsaveflagsregister,CS,andinstructionpointeronthestack.Ifinterruptcausesanerrorcode(like#PFforexample),CPUsavesanerroronthestackafterinstructionpointer;Afterinterrupthandlerexecuted,iretinstructionusedtoreturnfromit.
Nowlet'sbacktocode.
Westoppedatthefollowingpoint:
for(i=0;i<NUM_EXCEPTION_VECTORS;i++)
set_intr_gate(i,early_idt_handlers[i]);
Herewecallset_intr_gateintheloop,whichtakestwoparameters:
Numberofaninterrupt;Addressoftheidthandler.
andinsertsaninterruptgateinthenthIDTentry.Firstofalllet'slookontheearly_idt_handlers.Itisanarraywhichcontainsaddressofthefirst32interrupthandlers:
externconstcharearly_idt_handlers[NUM_EXCEPTION_VECTORS][2+2+5];
We'refillingonlyfirst32IDTentriesbecausealloftheearlysetuprunswithinterruptsdisabled,sothereisnoneedtosetupearlyexceptionhandlersforvectorsgreaterthan32.early_idt_handlerscontainsgenericidthandlersandwecanfinditinthearch/x86/kernel/head_64.S,wewilllookitsoon.
Nowlet'slookonset_intr_gateimplementation:
#defineset_intr_gate(n,addr)\
do{\
FillandloadIDT
LinuxInside
75Earlyinterruptshandler
BUG_ON((unsigned)n>0xFF);\
_set_gate(n,GATE_INTERRUPT,(void*)addr,0,0,\
__KERNEL_CS);\
_trace_set_gate(n,GATE_INTERRUPT,(void*)trace_##addr,\
0,0,__KERNEL_CS);\
}while(0)
Firstofallitcheckswiththatpassedinterruptnumberisnotgreaterthan255withBUG_ONmacro.Weneedtodothischeckbecausewecanhaveonly256interrupts.Afterthisitcalls_set_gatewhichwritesaddressofaninterruptgatetotheIDT:
staticinlinevoid_set_gate(intgate,unsignedtype,void*addr,
unsigneddpl,unsignedist,unsignedseg)
{
gate_descs;
pack_gate(&s,type,(unsignedlong)addr,dpl,ist,seg);
write_idt_entry(idt_table,gate,&s);
write_trace_idt_entry(gate,&s);
}
Atthestartof_set_gatefunctionwecanseecallofthepack_gatefunctionwhichfillsgate_descstructurewiththegivenvalues:
staticinlinevoidpack_gate(gate_desc*gate,unsignedtype,unsignedlongfunc,
unsigneddpl,unsignedist,unsignedseg)
{
gate->offset_low=PTR_LOW(func);
gate->segment=__KERNEL_CS;
gate->ist=ist;
gate->p=1;
gate->dpl=dpl;
gate->zero0=0;
gate->zero1=0;
gate->type=type;
gate->offset_middle=PTR_MIDDLE(func);
gate->offset_high=PTR_HIGH(func);
}
Asmentionedabovewefillgatedescriptorinthisfunction.Wefillthreepartsoftheaddressoftheinterrupthandlerwiththeaddresswhichwegotinthemainloop(addressoftheinterrupthandlerentrypoint).Weareusingthreefollowingmacrotosplitaddressonthreeparts:
#definePTR_LOW(x)((unsignedlonglong)(x)&0xFFFF)
#definePTR_MIDDLE(x)(((unsignedlonglong)(x)>>16)&0xFFFF)
#definePTR_HIGH(x)((unsignedlonglong)(x)>>32)
WiththefirstPTR_LOWmacrowegetthefirst2bytesoftheaddress,withthesecondPTR_MIDDLEwegetthesecond2bytesoftheaddressandwiththethirdPTR_HIGHmacrowegetthelast4bytesoftheaddress.Nextwesetupthesegmentselectorforinterrupthandler,itwillbeourkernelcodesegment-__KERNEL_CS.InthenextstepwefillInterruptStackTableandDescriptorPrivilegeLevel(highestprivilegelevel)withzeros.AndwesetGAT_INTERRUPTtypeintheend.
NowwehavefilledIDTentryandwecancallnative_write_idt_entryfunctionwhichjustcopiesfilledIDTentrytotheIDT:
staticinlinevoidnative_write_idt_entry(gate_desc*idt,intentry,constgate_desc*gate)
{
memcpy(&idt[entry],gate,sizeof(*gate));
}
LinuxInside
76Earlyinterruptshandler
Afterthatmainloopwillfinished,wewillhavefilledidt_tablearrayofgate_descstructuresandwecanloadIDTwith:
load_idt((conststructdesc_ptr*)&idt_descr);
Whereidt_descris:
structdesc_ptridt_descr={NR_VECTORS*16-1,(unsignedlong)idt_table};
andload_idtjustexecuteslidtinstruction:
asmvolatile("lidt%0"::"m"(*dtr));
Youcannotethattherearecallsofthe_trace_*functionsinthe_set_gateandotherfunctions.ThesefunctionsfillsIDTgatesinthesamemannerthat_set_gatebutwithonedifference.Thesefunctionsusetrace_idt_tableInterruptDescriptorTableinsteadofidt_tablefortracepoints(wewillcoverthisthemeintheanotherpart).
Okay,nowwehavefilledandloadedInterruptDescriptorTable,weknowhowtheCPUactsduringinterrupt.Sonowtimetodealwithinterruptshandlers.
Asyoucanreadabove,wefilledIDTwiththeaddressoftheearly_idt_handlers.Wecanfinditinthearch/x86/kernel/head_64.S:
.globlearly_idt_handlers
early_idt_handlers:
i=0
.reptNUM_EXCEPTION_VECTORS
.if(EXCEPTION_ERRCODE_MASK>>i)&1
ASM_NOP2
.else
pushq$0
.endif
pushq$i
jmpearly_idt_handler
i=i+1
.endr
Wecanseehere,interrupthandlersgenerationforthefirst32exceptions.Wecheckhere,ifexceptionhaserrorcodethenwedonothing,ifexceptiondoesnotreturnerrorcode,wepushzerotothestack.Wedoitforthatwouldstackwasuniform.Afterthatwepushexceptionnumberonthestackandjumpontheearly_idt_handlerwhichisgenericinterrupthandlerfornow.Asiwroteabove,CPUpushesflagregister,CSandRIPonthestack.Sobeforeearly_idt_handlerwillbeexecuted,stackwillcontainfollowingdata:
|--------------------|
|%rflags|
|%cs|
|%rip|
|rsp-->errorcode|
|--------------------|
Nowlet'slookontheearly_idt_handlerimplementation.Itlocatesinthesamearch/x86/kernel/head_64.S.Firstofallwe
Earlyinterruptshandlers
LinuxInside
77Earlyinterruptshandler
canseecheckforNMI,wenoneedtohandleit,sojustignoretheyintheearly_idt_handler:
cmpl$2,(%rsp)
jeis_nmi
whereis_nmi:
is_nmi:
addq$16,%rsp
INTERRUPT_RETURN
wedroperrorcodeandvectornumberfromthestackandcallINTERRUPT_RETURNwhichisjustiretq.AswecheckedthevectornumberanditisnotNMI,wecheckearly_recursion_flagtopreventrecursionintheearly_idt_handlerandifit'scorrectwesavegeneralregistersonthestack:
pushq%rax
pushq%rcx
pushq%rdx
pushq%rsi
pushq%rdi
pushq%r8
pushq%r9
pushq%r10
pushq%r11
weneedtodoittopreventwrongvaluesinitwhenwereturnfromtheinterrupthandler.Afterthiswechecksegmentselectorinthestack:
cmpl$__KERNEL_CS,96(%rsp)
jne11f
itmustbeequaltothekernelcodesegmentandifitisnotwejumponlabel11whichprintsPANICmessageandmakesstackdump.
Aftercodesegmentwaschecked,wecheckthevectornumber,andifitis#PF,weputvaluefromthecr2totherdiregisterandcallearly_make_pgtable(wellseeitsoon):
cmpl$14,72(%rsp)
jnz10f
GET_CR2_INTO(%rdi)
callearly_make_pgtable
andl%eax,%eax
jz20f
Ifvectornumberisnot#PF,werestoregeneralpurposeregistersfromthestack:
popq%r11
popq%r10
popq%r9
popq%r8
popq%rdi
popq%rsi
popq%rdx
popq%rcx
popq%rax
LinuxInside
78Earlyinterruptshandler
andexitfromthehandlerwithiret.
Itistheendofthefirstinterrupthandler.Notethatitisveryearlyinterrupthandler,soithandlesonlyPageFaultnow.Wewillseehandlersfortheotherinterrupts,butnowlet'slookonthepagefaulthandler.
Inthepreviousparagraphwesawfirstearlyinterrupthandlerwhichchecksinterruptnumberforpagefaultandcallsearly_make_pgtableforbuildingnewpagetablesifitis.Weneedtohave#PFhandlerinthisstepbecausethereareplanstoaddabilitytoloadkernelabove4Gandmakeaccesstoboot_paramsstructureabovethe4G.
Youcanfindimplementationoftheearly_make_pgtableinthearch/x86/kernel/head64.candtakesoneparameter-addressfromthecr2register,whichcausedPageFault.Let'slookonit:
int__initearly_make_pgtable(unsignedlongaddress)
{
unsignedlongphysaddr=address-__PAGE_OFFSET;
unsignedlongi;
pgdval_tpgd,*pgd_p;
pudval_tpud,*pud_p;
pmdval_tpmd,*pmd_p;
...
...
...
}
Itstartsfromthedefinitionofsomevariableswhichhave*val_ttypes.Allofthesetypesarejust:
typedefunsignedlongpgdval_t;
Alsowewilloperatewiththe*_t(notval)types,forexamplepgd_tandetc...Allofthesetypesdefinedinthearch/x86/include/asm/pgtable_types.handrepresentstructureslikethis:
typedefstruct{pgdval_tpgd;}pgd_t;
Forexample,
externpgd_tearly_level4_pgt[PTRS_PER_PGD];
Hereearly_level4_pgtpresentsearlytop-levelpagetabledirectorywhichconsistsofanarrayofpgd_ttypesandpgdpointstolow-levelpageentries.
Afterwemadethecheckthatwehavenoinvalidaddress,we'regettingtheaddressofthePageGlobalDirectoryentrywhichcontains#PFaddressandputit'svaluetothepgdvariable:
pgd_p=&early_level4_pgt[pgd_index(address)].pgd;
pgd=*pgd_p;
Inthenextstepwecheckpgd,ifitcontainscorrectpageglobaldirectoryentryweputphysicaladdressofthepageglobaldirectoryentryandputittothepud_pwith:
Pagefaulthandling
LinuxInside
79Earlyinterruptshandler
pud_p=(pudval_t*)((pgd&PTE_PFN_MASK)+__START_KERNEL_map-phys_base);
wherePTE_PFN_MASKisamacro:
#definePTE_PFN_MASK((pteval_t)PHYSICAL_PAGE_MASK)
whichexpandsto:
(~(PAGE_SIZE-1))&((1<<46)-1)
or
0b1111111111111111111111111111111111111111111111
whichis46bitstomaskpageframe.
Ifpgddoesnotcontaincorrectaddresswecheckthatnext_early_pgtisnotgreaterthanEARLY_DYNAMIC_PAGE_TABLESwhichis64andpresentafixednumberofbufferstosetupnewpagetablesondemand.Ifnext_early_pgtisgreaterthanEARLY_DYNAMIC_PAGE_TABLESweresetpagetablesandstartagain.Ifnext_early_pgtislessthanEARLY_DYNAMIC_PAGE_TABLES,wecreatenewpageupperdirectorypointerwhichpointstothecurrentdynamicpagetableandwritesit'sphysicaladdresswiththe_KERPG_TABLEaccessrightstothepageglobaldirectory:
if(next_early_pgt>=EARLY_DYNAMIC_PAGE_TABLES){
reset_early_page_tables();
gotoagain;
}
pud_p=(pudval_t*)early_dynamic_pgts[next_early_pgt++];
for(i=0;i<PTRS_PER_PUD;i++)
pud_p[i]=0;
*pgd_p=(pgdval_t)pud_p-__START_KERNEL_map+phys_base+_KERNPG_TABLE;
Afterthiswefixupaddressofthepageupperdirectorywith:
pud_p+=pud_index(address);
pud=*pud_p;
Inthenextstepwedothesameactionsaswedidbefore,butwiththepagemiddledirectory.Intheendwefixaddressofthepagemiddledirectorywhichcontainsmapskerneltext+datavirtualaddresses:
pmd=(physaddr&PMD_MASK)+early_pmd_flags;
pmd_p[pmd_index(address)]=pmd;
Afterpagefaulthandlerfinishedit'sworkandasresultourearly_level4_pgtcontainsentrieswhichpointtothevalidaddresses.
Conclusion
LinuxInside
80Earlyinterruptshandler
Thisistheendofthesecondpartaboutlinuxkernelinternals.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateissue.Inthenextpartwewillseeallstepsbeforekernelentrypoint-start_kernelfunction.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.
GNUassembly.reptAPICNMIPreviouspart
Links
LinuxInside
81Earlyinterruptshandler
ThisisthethirdpartoftheLinuxkernelinitializationprocessseries.Inthepreviouspartwesawearlyinterruptandexceptionhandlingandwillcontinuetodiveintothelinuxkernelinitializationprocessinthecurrentpart.Ournextpointis'kernelentrypoint'-start_kernelfunctionfromtheinit/main.csourcecodefile.Yes,technicallyitisnotkernel'sentrypointbutthestartofthegenerickernelcodewhichdoesnotdependoncertainarchitecture.Butbeforewewillseecallofthestart_kernelfunction,wemustdosomepreparations.Solet'scontinue.
InthepreviouspartwestoppedatsettingInterruptDescriptorTableandloadingitintheIDTRregister.Atthenextstepafterthiswecanseeacallofthecopy_bootdatafunction:
copy_bootdata(__va(real_mode_data));
Thisfunctiontakesoneargument-virtualaddressofthereal_mode_data.Rememberthatwepassedtheaddressoftheboot_paramsstructurefromarch/x86/include/uapi/asm/bootparam.htothex86_64_start_kernelfunctionasfirstargumentinarch/x86/kernel/head_64.S:
/*rsiispointertorealmodestructurewithinterestinginfo.
passittoC*/
movq%rsi,%rdi
Nowlet'slookat__vamacro.Thismacrodefinedininit/main.c:
#define__va(x)((void*)((unsignedlong)(x)+PAGE_OFFSET))
wherePAGE_OFFSETis__PAGE_OFFSETwhichis0xffff880000000000andthebasevirtualaddressofthedirectmappingofallphysicalmemory.Sowe'regettingvirtualaddressoftheboot_paramsstructureandpassittothecopy_bootdatafunction,wherewecopyreal_mod_datatotheboot_paramswhichisdeclaredinthearch/x86/kernel/setup.h
externstructboot_paramsboot_params;
Let'slookatthecopy_boot_dataimplementation:
staticvoid__initcopy_bootdata(char*real_mode_data)
{
char*command_line;
unsignedlongcmd_line_ptr;
memcpy(&boot_params,real_mode_data,sizeofboot_params);
sanitize_boot_params(&boot_params);
cmd_line_ptr=get_cmd_line_ptr();
if(cmd_line_ptr){
command_line=__va(cmd_line_ptr);
memcpy(boot_command_line,command_line,COMMAND_LINE_SIZE);
}
Kernelinitialization.Part3.
Lastpreparationsbeforethekernelentrypoint
boot_paramsagain
LinuxInside
82Lastpreparationsbeforethekernelentrypoint
}
Firstofall,notethatthisfunctionisdeclaredwith__initprefix.Itmeansthatthisfunctionwillbeusedonlyduringtheinitializationandusedmemorywillbefreed.
Wecanseedeclarationoftwovariablesforthekernelcommandlineandcopyingreal_mode_datatotheboot_paramswiththememcpyfunction.Thenextcallofthesanitize_boot_paramsfunctionwhichfillssomefieldsoftheboot_paramsstructurelikeext_ramdisk_imageandetc...ifbootloaderswhichfailtoinitializeunknownfieldsinboot_paramstozero.Afterthiswe'regettingaddressofthecommandlinewiththecalloftheget_cmd_line_ptrfunction:
unsignedlongcmd_line_ptr=boot_params.hdr.cmd_line_ptr;
cmd_line_ptr|=(u64)boot_params.ext_cmd_line_ptr<<32;
returncmd_line_ptr;
whichgetsthe64-bitaddressofthecommandlinefromthekernelbootheaderandreturnsit.Inthelaststepwecheckthatwegotcmd_line_pty,gettingitsvirtualaddressandcopyittotheboot_command_linewhichisjustanarrayofbytes:
externchar__initdataboot_command_line[];
Afterthiswewillhavecopiedkernelcommandlineandboot_paramsstructure.Inthenextstepwecanseecalloftheload_ucode_bspfunctionwhichloadsprocessormicrocode,butwewillnotseeithere.
Aftermicrocodewasloadedwecanseethecheckoftheconsole_loglevelandtheearly_printkfunctionwhichprintsKernelAlivestring.Butyou'llneverseethisoutputbecauseearly_printkisnotinitilizedyet.Itisaminorbuginthekernelandisentthepatch-commitandyouwillseeitinthemainlinesoon.Soyoucanskipthiscode.
Inthenextstepaswehavecopiedboot_paramsstructure,weneedtomovefromtheearlypagetablestothepagetablesforinitializationprocess.Wealreadysetearlypagetablesforswitchover,youcanreadaboutitinthepreviouspartanddroppedallitinthereset_early_page_tablesfunction(youcanreadaboutitinthepreviousparttoo)andkeptonlykernelhighmapping.Afterthiswecall:
clear_page(init_level4_pgt);
functionandpassinit_level4_pgtwhichdefinedalsointhearch/x86/kernel/head_64.Sandlooks:
NEXT_PAGE(init_level4_pgt)
.quadlevel3_ident_pgt-__START_KERNEL_map+_KERNPG_TABLE
.orginit_level4_pgt+L4_PAGE_OFFSET*8,0
.quadlevel3_ident_pgt-__START_KERNEL_map+_KERNPG_TABLE
.orginit_level4_pgt+L4_START_KERNEL*8,0
.quadlevel3_kernel_pgt-__START_KERNEL_map+_PAGE_TABLE
whichmapsfirst2gigabytesand512megabytesforthekernelcode,dataandbss.clear_pagefunctiondefinedinthearch/x86/lib/clear_page_64.Sletlookonthisfunction:
ENTRY(clear_page)
CFI_STARTPROC
xorl%eax,%eax
movl$4096/64,%ecx
Moveoninitpages
LinuxInside
83Lastpreparationsbeforethekernelentrypoint
.p2align4
.Lloop:
decl%ecx
#definePUT(x)movq%rax,x*8(%rdi)
movq%rax,(%rdi)
PUT(1)
PUT(2)
PUT(3)
PUT(4)
PUT(5)
PUT(6)
PUT(7)
leaq64(%rdi),%rdi
jnz.Lloop
nop
ret
CFI_ENDPROC
.Lclear_page_end:
ENDPROC(clear_page)
Asyoucanunderstartfromthefunctionnameitclearsorfillswithzerospagetables.FirstofallnotethatthisfunctionstartswiththeCFI_STARTPROCandCFI_ENDPROCwhichareexpandstoGNUassemblydirectives:
#defineCFI_STARTPROC.cfi_startproc
#defineCFI_ENDPROC.cfi_endproc
andusedfordebugging.AfterCFI_STARTPROCmacrowezeroouteaxregisterandput64totheecx(itwillbecounter).Nextwecanseeloopwhichstartswiththe.Llooplabelanditstartsfromtheecxdecrement.Afteritweputzerofromtheraxregistertotherdiwhichcontainsthebaseaddressoftheinit_level4_pgtnowanddothesameprocedureseventimesbuteverytimemoverdioffseton8.Afterthiswewillhavefirst64bytesoftheinit_level4_pgtfilledwithzeros.Inthenextstepweputtheaddressoftheinit_level4_pgtwith64-bytesoffsettotherdiagainandrepeatalloperationswhichecxisnotzero.Intheendwewillhaveinit_level4_pgtfilledwithzeros.
Aswehaveinit_level4_pgtfilledwithzeros,wesetthelastinit_level4_pgtentrytokernelhighmappingwiththe:
init_level4_pgt[511]=early_level4_pgt[511];
Rememberthatwedroppedallearly_level4_pgtentriesinthereset_early_page_tablefunctionandkeptonlykernelhighmappingthere.
Thelaststepinthex86_64_start_kernelfunctionisthecallofthe:
x86_64_start_reservations(real_mode_data);
functionwiththereal_mode_dataasargument.Thex86_64_start_reservationsfunctiondefinedinthesamesourcecodefileasthex86_64_start_kernelfunctionandlooks:
void__initx86_64_start_reservations(char*real_mode_data)
{
if(!boot_params.hdr.version)
copy_bootdata(__va(real_mode_data));
reserve_ebda_region();
start_kernel();
}
LinuxInside
84Lastpreparationsbeforethekernelentrypoint
Youcanseethatitisthelastfunctionbeforeweareinthekernelentrypoint-start_kernelfunction.Let'slookwhatitdoesandhowitworks.
Firstofallwecanseeinthex86_64_start_reservationsfunctioncheckforboot_params.hdr.version:
if(!boot_params.hdr.version)
copy_bootdata(__va(real_mode_data));
andifitisnotwecallagaincopy_bootdatafunctionwiththevirtualaddressofthereal_mode_data(readaboutaboutit'simplementation).
Inthenextstepwecanseethecallofthereserve_ebda_regionfunctionwhichdefinedinthearch/x86/kernel/head.c.ThisfunctionreservesmemoryblockforthEBDAorExtendedBIOSDataArea.TheExtendedBIOSDataArealocatedinthetopofconventionalmemoryandcontainsdataaboutports,diskparametersandetc...
Let'slookonthereserve_ebda_regionfunction.Itstartsfromthecheckingisparavirtualizationenabledornot:
if(paravirt_enabled())
return;
weexitfromthereserve_ebda_regionfunctionifparavirtualizationisenabledbecauseifitenabledtheextendedbiosdataareaisabsent.Inthenextstepweneedtogettheendofthelowmemory:
lowmem=*(unsignedshort*)__va(BIOS_LOWMEM_KILOBYTES);
lowmem<<=10;
We'regettingthevirtualaddressoftheBIOSlowmemoryinkilobytesandconvertittobyteswithshiftingiton10(multiplyon1024inotherwords).AfterthisweneedtogettheaddressoftheextendedBIOSdataarewiththe:
ebda_addr=get_bios_ebda();
whereget_bios_ebdafunctiondefinedinthearch/x86/include/asm/bios_ebda.handlookslike:
staticinlineunsignedintget_bios_ebda(void)
{
unsignedintaddress=*(unsignedshort*)phys_to_virt(0x40E);
address<<=4;
returnaddress;
}
Let'strytounderstandhowitworks.Herewecanseethatweconvertingphysicaladdress0x40Etothevirtual,where0x0040:0x000eisthesegmentwhichcontainsbaseaddressoftheextendedBIOSdataarea.Don'tworrythatweareusingphys_to_virtfunctionforconvertingaphysicaladdresstovirtualaddress.Youcannotethatpreviouslywehaveused__vamacroforthesamepoint,butphys_to_virtisthesame:
staticinlinevoid*phys_to_virt(phys_addr_taddress)
{
return__va(address);
Laststepbeforekernelentrypoint
LinuxInside
85Lastpreparationsbeforethekernelentrypoint
}
onlywithonedifference:wepassargumentwiththephys_addr_twhichdependsonCONFIG_PHYS_ADDR_T_64BIT:
#ifdefCONFIG_PHYS_ADDR_T_64BIT
typedefu64phys_addr_t;
#else
typedefu32phys_addr_t;
#endif
ThisconfigurationoptionisenabledbyCONFIG_PHYS_ADDR_T_64BIT.AfterthatwegotvirtualaddressofthesegmentwhichstoresthebaseaddressoftheextendedBIOSdataarea,weshiftiton4andreturn.Afterthisebda_addrvariablescontainsthebaseaddressoftheextendedBIOSdataarea.
InthenextstepwecheckthataddressoftheextendedBIOSdataareaandlowmemoryisnotlessthanINSANE_CUTOFFmacro
if(ebda_addr<INSANE_CUTOFF)
ebda_addr=LOWMEM_CAP;
if(lowmem<INSANE_CUTOFF)
lowmem=LOWMEM_CAP;
whichis:
#defineINSANE_CUTOFF0x20000U
or128kilobytes.Inthelaststepwegetlowerpartinthelowmemoryandextendedbiosdataareaandcallmemblock_reservefunctionwhichwillreservememoryregionforextendedbiosdatabetweenlowmemoryandonemegabytemark:
lowmem=min(lowmem,ebda_addr);
lowmem=min(lowmem,LOWMEM_CAP);
memblock_reserve(lowmem,0x100000-lowmem);
memblock_reservefunctionisdefinedatmm/block.candtakestwoparameters:
basephysicaladdress;regionsize.
andreservesmemoryregionforthegivenbaseaddressandsize.memblock_reserveisthefirstfunctioninthisbookfromlinuxkernelmemorymanagerframework.Wewilltakeacloserlookonmemorymanagersoon,butnowlet'slookatitsimplementation.
Inthepreviousparagraphwestoppedatthecallofthememblock_reservefunctionandasisadbeforeitisthefirstfunctionfromthememorymanagerframework.Let'strytounderstandhowitworks.memblock_reservefunctionjustcalls:
memblock_reserve_region(base,size,MAX_NUMNODES,0);
Firsttouchofthelinuxkernelmemorymanagerframework
LinuxInside
86Lastpreparationsbeforethekernelentrypoint
functionandpasses4parametersthere:
physicalbaseaddressofthememoryregion;sizeofthememoryregion;maximumnumberofnumanodes;flags.
Atthestartofthememblock_reserve_regionbodywecanseedefinitionofthememblock_typestructure:
structmemblock_type*_rgn=&memblock.reserved;
whichpresentsthetypeofthememoryblockandlooks:
structmemblock_type{
unsignedlongcnt;
unsignedlongmax;
phys_addr_ttotal_size;
structmemblock_region*regions;
};
Asweneedtoreservememoryblockforextendedbiosdataarea,thetypeofthecurrentmemoryregionisreservedwherememblockstructureis:
structmemblock{
boolbottom_up;
phys_addr_tcurrent_limit;
structmemblock_typememory;
structmemblock_typereserved;
#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP
structmemblock_typephysmem;
#endif
};
anddescribesgenericmemoryblock.Youcanseethatweinitialize_rgnbyassigningittotheaddressofthememblock.reserved.memblockistheglobalvariablewhichlooks:
structmemblockmemblock__initdata_memblock={
.memory.regions=memblock_memory_init_regions,
.memory.cnt=1,
.memory.max=INIT_MEMBLOCK_REGIONS,
.reserved.regions=memblock_reserved_init_regions,
.reserved.cnt=1,
.reserved.max=INIT_MEMBLOCK_REGIONS,
#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP
.physmem.regions=memblock_physmem_init_regions,
.physmem.cnt=1,
.physmem.max=INIT_PHYSMEM_REGIONS,
#endif
.bottom_up=false,
.current_limit=MEMBLOCK_ALLOC_ANYWHERE,
};
Wewillnotdiveintodetailofthisvaraible,butwewillseealldetailsaboutitinthepartsaboutmemorymanager.Justnotethatmemblockvariabledefinedwiththe__initdata_memblockwhichis:
#define__initdata_memblock__meminitdata
LinuxInside
87Lastpreparationsbeforethekernelentrypoint
and__meminit_datais:
#define__meminitdata__section(.meminit.data)
Fromthiswecanconcludethatallmemoryblockswillbeinthe.meminit.datasection.Afterwedefined_rgnweprintinformationaboutitwithmemblock_dbgmacros.Youcanenableitbypassingmemblock=debugtothekernelcommandline.
Afterdebugginglineswereprintednextisthecallofthefollowingfunction:
memblock_add_range(_rgn,base,size,nid,flags);
whichaddsnewmemoryblockregionintothe.meminit.datasection.Aswedonotinitlieze_rgnbutitjustcontains&memblock.reserved,wejustfillpassed_rgnwiththebaseaddressoftheextendedBIOSdataarearegion,sizeofthisregionandflags:
if(type->regions[0].size==0){
WARN_ON(type->cnt!=1||type->total_size);
type->regions[0].base=base;
type->regions[0].size=size;
type->regions[0].flags=flags;
memblock_set_region_node(&type->regions[0],nid);
type->total_size=size;
return0;
}
Afterwefilledourregionwecanseethecallofthememblock_set_region_nodefunctionwithtwoparameters:
addressofthefilledmemoryregion;NUMAnodeid.
whereourregionsrepresentedbythememblock_regionstructure:
structmemblock_region{
phys_addr_tbase;
phys_addr_tsize;
unsignedlongflags;
#ifdefCONFIG_HAVE_MEMBLOCK_NODE_MAP
intnid;
#endif
};
NUMAnodeiddependsonMAX_NUMNODESmacrowhichisdefinedintheinclude/linux/numa.h:
#defineMAX_NUMNODES(1<<NODES_SHIFT)
whereNODES_SHIFTdependsonCONFIG_NODES_SHIFTconfigurationparameteranddefinedas:
#ifdefCONFIG_NODES_SHIFT
#defineNODES_SHIFTCONFIG_NODES_SHIFT
#else
#defineNODES_SHIFT0
#endif
LinuxInside
88Lastpreparationsbeforethekernelentrypoint
memblick_set_region_nodefunctionjustfillsnidfieldfrommemblock_regionwiththegivenvalue:
staticinlinevoidmemblock_set_region_node(structmemblock_region*r,intnid)
{
r->nid=nid;
}
Afterthiswewillhavefirstreservedmemblockfortheextendedbiosdataareainthe.meminit.datasection.reserve_ebda_regionfunctionfinisheditsworkonthisstepandwecangobacktothearch/x86/kernel/head64.c.
Wefinishedallpreparationsbeforethekernelentrypoint!Thelaststepinthex86_64_start_reservationsfunctionisthecallofthe:
start_kernel()
functionfrominit/main.cfile.
That'sallforthispart.
Itistheendofthethirdpartaboutlinuxkernelinternals.Innextpartwewillseethefirstinitializationstepsinthekernelentrypoint-start_kernelfunction.Itwillbethefirststepbeforewewillseelaunchofthefirstinitprocess.
Ifyouhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
BIOSdataareaWhatisintheextendedBIOSdataareaonaPC?Previouspart
Conclusion
Links
LinuxInside
89Lastpreparationsbeforethekernelentrypoint
Ifyouhavereadthepreviouspart-Lastpreparationsbeforethekernelentrypoint,youcanrememberthatwefinishedallpre-initializationstuffandstoppedrightbeforethecalltothestart_kernelfunctionfromtheinit/main.c.Thestart_kernelistheentryofthegenericandarchitectureindependentkernelcode,althoughwewillreturntothearch/foldermanytimes.Ifyoulookinsideofthestart_kernelfunction,youwillseethatthisfunctionisverybig.Forthismomentitcontainsabout86callsoffunctions.Yes,it'sverybigandofcoursethispartwillnotcoveralltheprocessesthatoccurinthisfunction.Inthecurrentpartwewillonlystarttodoit.ThispartandallthenextwhichwillbeintheKernelinitializationprocesschapterwillcoverit.
Themainpurposeofthestart_kerneltofinishkernelinitializationprocessandlaunchthefirstinitprocess.Beforethefirstprocesswillbestarted,thestart_kernelmustdomanythingssuchas:toenablelockvalidator,toinitializeprocessorid,toenableearlycgroupssubsystem,tosetupper-cpuareas,toinitializedifferentcachesinvfs,toinitializememorymanager,rcu,vmalloc,scheduler,IRQs,ACPIandmanymanymore.Onlyafterthesestepswewillseethelaunchofthefirstinitprocessinthelastpartofthischapter.Somuchkernelcodeawaitsus,let'sstart.
NOTE:AllpartsfromthisbigchapterLinuxKernelinitializationprocesswillnotcoveranythingaboutdebugging.Therewillbeaseparatechapteraboutkerneldebuggingtips.
AsIwroteabove,thestart_kernelfunctionisdefinedintheinit/main.c.Thisfunctiondefinedwiththe__initattributeandasyoualreadymayknowfromotherparts,allfunctionswhicharedefinedwiththisattributearenecessaryduringkernelinitialization.
#define__init__section(.init.text)__coldnotrace
Aftertheinitializationprocesswillbefinished,thekernelwillreleasethesesectionswithacalltothefree_initmemfunction.Notealsothat__initisdefinedwithtwoattributes:__coldandnotrace.Thepurposeofthefirstcoldattributeistomarkthatthefunctionisrarelyusedandthecompilermustoptimizethisfunctionforsize.Thesecondnotraceisdefinedas:
#definenotrace__attribute__((no_instrument_function))
whereno_instrument_functionsaystothecompilernottogenerateprofilingfunctioncalls.
Inthedefinitionofthestart_kernelfunction,youcanalsoseethe__visibleattributewhichexpandstothe:
#define__visible__attribute__((externally_visible))
whereexternally_visibletellstothecompilerthatsomethingusesthisfunctionorvariable,topreventmarkingthisfunction/variableasunusable.Youcanfindthedefinitionofthisandothermacroattributesininclude/linux/init.h.
Kernelinitialization.Part4.
Kernelentrypoint
Alittleaboutfunctionattributes
Firststepsinthestart_kernel
LinuxInside
90Kernelentrypoint
Atthebeginningofthestart_kernelyoucanseethedefinitionofthesetwovariables:
char*command_line;
char*after_dashes;
Thefirstrepresentsapointertothekernelcommandlineandthesecondwillcontaintheresultoftheparse_argsfunctionwhichparsesaninputstringwithparametersintheformname=value,lookingforspecifickeywordsandinvokingtherighthandlers.Wewillnotgointothedetailsrelatedwiththesetwovariablesatthistime,butwillseeitinthenextparts.Inthenextstepwecanseeacalltothe:
lockdep_init();
function.lockdep_initinitializeslockvalidator.Itsimplementationisprettysimple,itjustinitializestwolist_headhashesandsetsthelockdep_initializedglobalvariableto1.Lockvalidatordetectscircularlockdependenciesandiscalledwhenanyspinlockormutexisacquired.
Thenextfunctionisset_task_stack_end_magicwhichtakesaddressoftheinit_taskandsetsSTACK_END_MAGIC(0x57AC6E9D)ascanaryforit.init_taskrepresentstheinitialtaskstructure:
structtask_structinit_task=INIT_TASK(init_task);
wheretask_structstoresalltheinformationaboutaprocess.Iwillnotexplainthisstructureinthisbookbecauseit'sverybig.Youcanfinditsdefinitionininclude/linux/sched.h.Atthismomenttask_structcontainsmorethan100fields!Althoughyouwillnotseetheexplanationofthetask_structinthisbook,wewilluseitveryoftensinceitisthefundamentalstructurewhichdescribestheprocessintheLinuxkernel.Iwilldescribethemeaningofthefieldsofthisstructureaswemeettheminpractice.
Youcanseethedefinitionoftheinit_taskanditinitializedbytheINIT_TASKmacro.Thismacroisfrominclude/linux/init_task.handitjustfillstheinit_taskwiththevaluesforthefirstprocess.Forexampleitsets:
initprocessstatetozeroorrunnable.ArunnableprocessisonewhichiswaitingonlyforaCPUtorunon;initprocessflags-PF_KTHREADwhichmeans-kernelthread;alistofrunnabletask;processaddressspace;initprocessstacktothe&init_thread_infowhichisinit_thread_union.thread_infoandinitthread_unionhastype-thread_unionwhichcontainsthread_infoandprocessstack:
unionthread_union{
structthread_infothread_info;
unsignedlongstack[THREAD_SIZE/sizeof(long)];
};
Everyprocesshasitsownstackanditis16killobytesor4pageframes.inx86_64.Wecannotethatitisdefinedasarrayofunsignedlong.Thenextfieldofthethread_unionis-thread_infodefinedas:
structthread_info{
structtask_struct*task;
structexec_domain*exec_domain;
__u32flags;
__u32status;
__u32cpu;
intsaved_preempt_count;
LinuxInside
91Kernelentrypoint
mm_segment_taddr_limit;
structrestart_blockrestart_block;
void__user*sysenter_return;
unsignedintsig_on_uaccess_error:1;
unsignedintuaccess_err:1;
};
andoccupies52bytes.Thethread_infostructurecontainsarchitecture-specificinformationonthethread.Weknowthatonx86_64thestackgrowsdownandthread_union.thread_infoisstoredatthebottomofthestackinourcase.Sotheprocessstackis16killobytesandthread_infoisatthebottom.Theremainingthread_sizewillbe16killobytes-62bytes=16332bytes.Notethatthread_uniounrepresentedastheunionandnotstructure,itmeansthatthread_infoandstacksharethememoryspace.
Schematicallyitcanberepresentedasfollows:
+-----------------------+
||
||
|stack|
||
|_______________________|
|||
|||
|||
|__________↓____________|+--------------------+
||||
|thread_info|<----------->|task_struct|
||||
+-----------------------++--------------------+
http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct
SotheINIT_TASKmacrofillsthesetask_struct'sfieldsandmanymanymore.AsIalreadywroteabout,IwillnotdescribeallthefieldsandvaluesintheINIT_TASKmacrobutwewillseethemsoon.
Nowlet'sgobacktotheset_task_stack_end_magicfunction.Thisfunctiondefinedinthekernel/fork.candsetsacanarytotheinitprocessstacktopreventstackoverflow.
voidset_task_stack_end_magic(structtask_struct*tsk)
{
unsignedlong*stackend;
stackend=end_of_stack(tsk);
*stackend=STACK_END_MAGIC;/*foroverflowdetection*/
}
Itsimplementationissimple.set_task_stack_end_magicgetstheendofthestackforthegiventask_structwiththeend_of_stackfunction.TheendofaprocessstackdependsontheCONFIG_STACK_GROWSUPconfigurationoption.Aswelearninx86_64architecture,thestackgrowsdown.Sotheendoftheprocessstackwillbe:
(unsignedlong*)(task_thread_info(p)+1);
wheretask_thread_infojustreturnsthestackwhichwefilledwiththeINIT_TASKmacro:
#definetask_thread_info(task)((structthread_info*)(task)->stack)
LinuxInside
92Kernelentrypoint
Aswegottheendoftheinitprocessstack,wewriteSTACK_END_MAGICthere.Aftercanaryisset,wecancheckitlikethis:
if(*end_of_stack(task)!=STACK_END_MAGIC){
//
//handlestackoverflowhere
//
}
Thenextfunctionaftertheset_task_stack_end_magicissmp_setup_processor_id.Thisfunctionhasanemptybodyforx86_64:
void__init__weaksmp_setup_processor_id(void)
{
}
asitnotimplementedforallarchitectures,butsomesuchass390andarm64.
Thenextfunctioninstart_kernelisdebug_objects_early_init.Implementationofthisfunctionisalmostthesameaslockdep_init,butfillshashesforobjectdebugging.AsIwroteabout,wewillnotseetheexplanationofthisandotherfunctionswhicharefordebuggingpurposesinthischapter.
Afterthedebug_object_early_initfunctionwecanseethecalloftheboot_init_stack_canaryfunctionwhichfillstask_struct->canarywiththecanaryvalueforthe-fstack-protectorgccfeature.ThisfunctiondependsontheCONFIG_CC_STACKPROTECTORconfigurationoptionandifthisoptionisdisabled,boot_init_stack_canarydoesnothing,otherwiseitgeneratesrandomnumbersbasedonrandompoolandtheTSC:
get_random_bytes(&canary,sizeof(canary));
tsc=__native_read_tsc();
canary+=tsc+(tsc<<32UL);
Afterwegotarandomnumber,wefillthestack_canaryfieldoftask_structwithit:
current->stack_canary=canary;
andwritethisvaluetothetopoftheIRQstackwiththe:
this_cpu_write(irq_stack_union.stack_canary,canary);//readbelowaboutthis_cpu_write
Again,wewillnotdiveintodetailshere,wewillcoveritinthepartaboutIRQs.Ascanaryisset,wedisablelocalandearlybootIRQsandregisterthebootstrapCPUintheCPUmaps.WedisablelocalIRQs(interruptsforcurrentCPU)withthelocal_irq_disablemacrowhichexpandstothecallofthearch_local_irq_disablefunctionfrominclude/linux/percpu-defs.h:
staticinlinenotracevoidarch_local_irq_enable(void)
{
native_irq_enable();
}
Wherenative_irq_enableiscliinstructionforx86_64.AsinterruptsaredisabledwecanregisterthecurrentCPUwiththegivenIDintheCPUbitmap.
LinuxInside
93Kernelentrypoint
Thecurrentfunctionfromthestart_kernelisboot_cpu_init.ThisfunctioninitializesvariousCPUmasksforthebootstrapprocessor.Firstofallitgetsthebootstrapprocessoridwithacallto:
intcpu=smp_processor_id();
Fornowitisjustzero.IftheCONFIG_DEBUG_PREEMPTconfigurationoptionisdisabled,smp_processor_idjustexpandstothecallofraw_smp_processor_idwhichexpandstothe:
#defineraw_smp_processor_id()(this_cpu_read(cpu_number))
this_cpu_readasmanyotherfunctionlikethis(this_cpu_write,this_cpu_addandetc...)definedintheinclude/linux/percpu-defs.handpresentsthis_cpuoperation.Theseoperationsprovideawayofoptimizingaccesstotheper-cpuvariableswhichareassociatedwiththecurrentprocessor.Inourcaseitisthis_cpu_read:
__pcpu_size_call_return(this_cpu_read_,pcp)
Rememberthatwehavepassedcpu_numberaspcptothethis_cpu_readfromtheraw_smp_processor_id.Nowlet'slookatthe__pcpu_size_call_returnimplementation:
#define__pcpu_size_call_return(stem,variable)\
({\
typeof(variable)pscr_ret__;\
__verify_pcpu_ptr(&(variable));\
switch(sizeof(variable)){\
case1:pscr_ret__=stem##1(variable);break;\
case2:pscr_ret__=stem##2(variable);break;\
case4:pscr_ret__=stem##4(variable);break;\
case8:pscr_ret__=stem##8(variable);break;\
default:\
__bad_size_call_parameter();break;\
}\
pscr_ret__;\
})
Yes,itlooksalittlestrangebutit'seasy.Firstofallwecanseethedefinitionofthepscr_ret__variablewiththeinttype.Whyint?Ok,variableiscommon_cpuanditwasdeclaredasper-cpuintvariable:
DECLARE_PER_CPU_READ_MOSTLY(int,cpu_number);
Inthenextstepwecall__verify_pcpu_ptrwiththeaddressofcpu_number.__veryf_pcpu_ptrusedtoverifythatthegivenparameterisaper-cpupointer.Afterthatwesetpscr_ret__valuewhichdependsonthesizeofthevariable.Ourcommon_cpuvariableisint,soit4bytesinsize.Itmeansthatwewillgetthis_cpu_read_4(common_cpu)inpscr_ret__.Intheendofthe__pcpu_size_call_returnwejustcallit.this_cpu_read_4isamacro:
#definethis_cpu_read_4(pcp)percpu_from_op("mov",pcp)
whichcallspercpu_from_opandpassmovinstructionandper-cpuvariablethere.percpu_from_opwillexpandtotheinlineassemblycall:
Thefirstprocessoractivation
LinuxInside
94Kernelentrypoint
asm("movl%%gs:%1,%0":"=r"(pfo_ret__):"m"(common_cpu))
Let'strytounderstandhowitworksandwhatitdoes.Thegssegmentregistercontainsthebaseofper-cpuarea.Herewejustcopycommon_cpuwhichisinmemorytothepfo_ret__withthemovlinstruction.Orwithanotherwords:
this_cpu_read(common_cpu)
isthesameas:
movl%gs:$common_cpu,$pfo_ret__
Aswedidn'tsetupper-cpuarea,wehaveonlyone-forthecurrentrunningCPU,wewillgetzeroasaresultofthesmp_processor_id.
Aswegotthecurrentprocessorid,boot_cpu_initsetsthegivenCPUonline,active,presentandpossiblewiththe:
set_cpu_online(cpu,true);
set_cpu_active(cpu,true);
set_cpu_present(cpu,true);
set_cpu_possible(cpu,true);
Allofthesefunctionsusetheconcept-cpumask.cpu_possibleisasetofCPUID'swhichcanbepluggedinatanytimeduringthelifeofthatsystemboot.cpu_presentrepresentswhichCPUsarecurrentlypluggedin.cpu_onlinerepresentssubsetofthecpu_presentandindicatesCPUswhichareavailableforscheduling.ThesemasksdependontheCONFIG_HOTPLUG_CPUconfigurationoptionandifthisoptionisdisabledpossible==presentandactive==online.Implementationoftheallofthesefunctionsareverysimilar.Everyfunctionchecksthesecondparameter.Ifitistrue,itcallscpumask_set_cpuorcpumask_clear_cpuotherwise.
Forexamplelet'slookatset_cpu_possible.Aswepassedtrueasthesecondparameter,the:
cpumask_set_cpu(cpu,to_cpumask(cpu_possible_bits));
willbecalled.Firstofalllet'strytounderstandtheto_cpu_maskmacro.Thismacrocastsabitmaptoastructcpumask*.CPUmasksprovideabitmapsuitableforrepresentingthesetofCPU'sinasystem,onebitpositionperCPUnumber.CPUmaskpresentedbythecpu_maskstructure:
typedefstructcpumask{DECLARE_BITMAP(bits,NR_CPUS);}cpumask_t;
whichisjustbitmapdeclaredwiththeDECLARE_BITMAPmacro:
#defineDECLARE_BITMAP(name,bits)unsignedlongname[BITS_TO_LONGS(bits)]
Aswecanseefromitsdefinition,theDECLARE_BITMAPmacroexpandstothearrayofunsignedlong.Nowlet'slookathowtheto_cpumaskmacroisimplemented:
#defineto_cpumask(bitmap)\
((structcpumask*)(1?(bitmap)\
LinuxInside
95Kernelentrypoint
:(void*)sizeof(__check_is_bitmap(bitmap))))
Idon'tknowaboutyou,butitlookedreallyweirdformeatthefirsttime.Wecanseeaternaryoperatorherewhichistrueeverytime,butwhythe__check_is_bitmaphere?It'ssimple,let'slookatit:
staticinlineint__check_is_bitmap(constunsignedlong*bitmap)
{
return1;
}
Yeah,itjustreturns1everytime.Actuallyweneedinithereonlyforonepurpose:atcompiletimeitchecksthatthegivenbitmapisabitmap,orinotherwordsitchecksthatthegivenbitmaphasatypeofunsignedlong*.Sowejustpasscpu_possible_bitstotheto_cpumaskmacroforconvertingthearrayofunsignedlongtothestructcpumask*.Nowwecancallcpumask_set_cpufunctionwiththecpu-0andstructcpumask*cpu_possible_bits.Thisfunctionmakesonlyonecalloftheset_bitfunctionwhichsetsthegivencpuinthecpumask.Alloftheseset_cpu_*functionsworkonthesameprinciple.
Ifyou'renotsurethatthisset_cpu_*operationsandcpumaskarenotclearforyou,don'tworryaboutit.Youcangetmoreinfobyreadingthespecialpartaboutit-cpumaskordocumentation.
Asweactivatedthebootstrapprocessor,it'stimetogotothenextfunctioninthestart_kernel.Nowitispage_address_init,butthisfunctiondoesnothinginourcase,becauseitexecutesonlywhenallRAMcan'tbemappeddirectly.
Thenextcallispr_notice:
#definepr_notice(fmt,...)\
printk(KERN_NOTICEpr_fmt(fmt),##__VA_ARGS__)
asyoucanseeitjustexpandstotheprintkcall.Atthismomentweusepr_noticetoprinttheLinuxbanner:
pr_notice("%s",linux_banner);
whichisjustthekernelversionwithsomeadditionalparameters:
Linuxversion4.0.0-rc6+(alex@localhost)(gccversion4.9.1(Ubuntu4.9.1-16ubuntu6))#319SMP
Thenextstepisarchitecture-specificinitializations.TheLinuxkerneldoesitwiththecallofthesetup_archfunction.Thisisaverybigfunctionlikestart_kernelandwedonothavetimetoconsiderallofitsimplementationinthispart.Herewe'llonlystarttodoitandcontinueinthenextpart.Asitisarchitecture-specific,weneedtogoagaintothearch/directory.Thesetup_archfunctiondefinedinthearch/x86/kernel/setup.csourcecodefileandtakesonlyoneargument-addressofthekernelcommandline.
Thisfunctionstartsfromthereservingmemoryblockforthekernel_textand_datawhichstartsfromthe_textsymbol
Printlinuxbanner
Architecture-dependentpartsofinitialization
LinuxInside
96Kernelentrypoint
(youcanrememberitfromthearch/x86/kernel/head_64.S)andendsbefore__bss_stop.Weareusingmemblockforthereservingofmemoryblock:
memblock_reserve(__pa_symbol(_text),(unsignedlong)__bss_stop-(unsignedlong)_text);
YoucanreadaboutmemblockintheLinuxkernelmemorymanagementPart1..Asyoucanremembermemblock_reservefunctiontakestwoparameters:
basephysicaladdressofamemoryblock;sizeofamemoryblock.
Wecangetthebasephysicaladdressofthe_textsymbolwiththe__pa_symbolmacro:
#define__pa_symbol(x)\
__phys_addr_symbol(__phys_reloc_hide((unsignedlong)(x)))
Firstofallitcalls__phys_reloc_hidemacroonthegivenparameter.The__phys_reloc_hidemacrodoesnothingforx86_64andjustreturnsthegivenparameter.Implementationofthe__phys_addr_symbolmacroiseasy.Itjustsubtractsthesymboladdressfromthebaseaddressofthekerneltextmappingbasevirtualaddress(youcanrememberthatitis__START_KERNEL_map)andaddsphys_basewhichisthebaseaddressof_text:
#define__phys_addr_symbol(x)\
((unsignedlong)(x)-__START_KERNEL_map+phys_base)
Afterwegotthephysicaladdressofthe_textsymbol,memblock_reservecanreserveamemoryblockfromthe_texttothe__bss_stop-_text.
Inthenextstepafterwereservedplaceforthekerneltextanddataisreservingplacefortheinitrd.Wewillnotseedetailsaboutinitrdinthispost,youjustmayknowthatitistemporaryrootfilesystemstoredinmemoryandusedbythekernelduringitsstartup.Theearly_reserve_initrdfunctiondoesallwork.Firstofallthisfunctiongetsthebaseaddressoftheramdisk,itssizeandtheendaddresswith:
u64ramdisk_image=get_ramdisk_image();
u64ramdisk_size=get_ramdisk_size();
u64ramdisk_end=PAGE_ALIGN(ramdisk_image+ramdisk_size);
Alloftheseparametersaretakenfromboot_params.IfyouhavereadthechapteraboutLinuxKernelBootingProcess,youmustrememberthatwefilledtheboot_paramsstructureduringboottime.Thekernelsetupheadercontainsacoupleoffieldswhichdescribesramdisk,forexample:
Fieldname:ramdisk_image
Type:write(obligatory)
Offset/size:0x218/4
Protocol:2.00+
The32-bitlinearaddressoftheinitialramdiskorramfs.Leaveat
zeroifthereisnoinitialramdisk/ramfs.
Reservememoryforinitrd
LinuxInside
97Kernelentrypoint
Sowecangetalltheinformationthatinterestsusfromboot_params.Forexamplelet'slookatget_ramdisk_image:
staticu64__initget_ramdisk_image(void)
{
u64ramdisk_image=boot_params.hdr.ramdisk_image;
ramdisk_image|=(u64)boot_params.ext_ramdisk_image<<32;
returnramdisk_image;
}
Herewegettheaddressoftheramdiskfromtheboot_paramsandshiftleftiton32.WeneedtodoitbecauseasyoucanreadintheDocumentation/x86/zero-page.txt:
0C0/004ALLext_ramdisk_imageramdisk_imagehigh32bits
Soaftershiftingiton32,we'regettinga64-bitaddressinramdisk_imageandwereturnit.get_ramdisk_sizeworksonthesameprincipleasget_ramdisk_image,butitusedext_ramdisk_sizeinsteadofext_ramdisk_image.Afterwegotramdisk'ssize,baseaddressandendaddress,wecheckthatbootloaderprovidedramdiskwiththe:
if(!boot_params.hdr.type_of_loader||
!ramdisk_image||!ramdisk_size)
return;
andreservememoryblockwiththecalculatedaddressesfortheinitialramdiskintheend:
memblock_reserve(ramdisk_image,ramdisk_end-ramdisk_image);
ItistheendofthefourthpartabouttheLinuxkernelinitializationprocess.Westartedtodiveinthekernelgenericcodefromthestart_kernelfunctioninthispartandstoppedonthearchitecture-specificinitializationsinthesetup_arch.Inthenextpartwewillcontinuewitharchitecture-dependentinitializationsteps.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmeaPRtolinux-internals.
GCCfunctionattributesthis_cpuoperationscpumasklockvalidatorcgroupsstackbufferoverflowIRQsinitrdPreviouspart
Conclusion
Links
LinuxInside
98Kernelentrypoint
LinuxInside
99Kernelentrypoint
Inthepreviouspart,westoppedattheinitializationofanarchitecture-specificstufffromthesetup_archfunctionandwillcontinuewithit.Aswereservedmemoryfortheinitrd,nextstepistheolpc_ofw_detectwhichdetectsOneLaptopPerChildsupport.Wewillnotconsiderplatformrelatedstuffinthisbookandwillmissfunctionsrelatedwithit.Solet'sgoahead.Thenextstepistheearly_trap_initfunction.Thisfunctioninitializesdebug(#DB-raisedwhentheTFflagofrflagsisset)andint3(#BP)interruptsgate.Ifyoudon'tknowanythingaboutinterrupts,youcanreadaboutitintheEarlyinterruptandexceptionhandling.Inx86architectureINT,INTOandINT3arespecialinstructionswhichallowatasktoexplicitlycallaninterrupthandler.TheINT3instructioncallsthebreakpoint(#BP)handler.Youcanremember,wealreadysawitinthepartaboutinterrupts:andexceptions:
----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description|Type|ErrorCode|Source|
----------------------------------------------------------------------------------------------
|3|#BP|Breakpoint|Trap|NO|INT3|
----------------------------------------------------------------------------------------------
Debuginterrupt#DBistheprimarymeansofinvokingdebuggers.early_trap_initdefinedinthearch/x86/kernel/traps.c.Thisfunctionssets#DBand#BPhandlersandreloadsIDT:
void__initearly_trap_init(void)
{
set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);
load_idt(&idt_descr);
}
Wealreadysawimplementationoftheset_intr_gateinthepreviouspartaboutinterrupts.Herearetwosimilarfunctionsset_intr_gate_istandset_system_intr_gate_ist.Bothofthesetwofunctionstaketwoparameters:
numberoftheinterrupt;baseaddressoftheinterrupt/exceptionhandler;thirdparameteris-InterruptStackTable.ISTisanewmechanisminthex86_64andpartoftheTSS.Everyactivethreadinkernelmodehasownkernelstackwhichis16killobytes.Whileathreadinuserspace,kernelstackisemptyexceptthread_info(readaboutitpreviouspart)atthebottom.Inadditiontoper-threadstacks,thereareacoupleofspecializedstacksassociatedwitheachCPU.Allaboutthesestackyoucanreadinthelinuxkerneldocumentation-Kernelstacks.x86_64providesfeaturewhichallowstoswitchtoanewspecialstackforduringanyeventsasnon-maskableinterruptandetc...Andthenameofthisfeatureis-InterruptStackTable.Therecanbeupto7ISTentriesperCPUandeveryentrypointstothededicatedstack.InourcasethisisDEBUG_STACK.
set_intr_gate_istandset_system_intr_gate_istworkbythesameprincipleasset_intr_gatewithonlyonedifference.Bothofthesefunctionschecksinterruptnumberandcall_set_gateinside:
BUG_ON((unsigned)n>0xFF);
_set_gate(n,GATE_INTERRUPT,addr,0,ist,__KERNEL_CS);
asset_intr_gatedoesthis.Butset_intr_gatecalls_set_gatewithdpl-0,andist-0,butset_intr_gate_istandset_system_intr_gate_istsetsistasDEBUG_STACKandset_system_intr_gate_istsetsdplas0x3whichisthelowest
Kernelinitialization.Part5.
Continueofarchitecture-specificinitializations
LinuxInside
100Continuearchitecture-specificboot-timeinitializations
privilege.Whenaninterruptoccursandthehardwareloadssuchadescriptor,thenhardwareautomaticallysetsthenew
stackpointerbasedontheISTvalue,theninvokestheinterrupthandler.Allofthespecialkernelstackswillbesettedinthecpu_initfunction(wewillseeitlater).
As#DBand#BPgateswrittentotheidt_descr,wereloadIDTtablewithload_idtwhichjustcalsldtrinstruction.Nowlet'slookoninterrupthandlersandwilltrytounderstandhowtheyworks.Ofcourse,Ican'tcoverallinterrupthandlersinthisbookandIdonotseethepointinthis.Itisveryinterestingtodelveinthelinuxkernelsourcecode,sowewillseehowdebughandlerimplementedinthispart,andunderstandhowotherinterrupthandlersareimplementedwillbeyourtask.
Asyoucanreadabove,wepassedaddressofthe#DBhandleras&debugintheset_intr_gate_ist.lxr.free-electorns.comisagreatresourceforsearchingidentificatorsinthelinuxkernelsourcecode,butunfortunatelyyouwillnotfinddebughandlerwithit.Allofyoucanfind,itisdebugdefinitioninthearch/x86/include/asm/traps.h:
asmlinkagevoiddebug(void);
Wecanseeasmlinkageattributewhichtellstousthatdebugisfunctionwrittenwithassembly.Yeah,againandagainassembly:).Implementationofthe#DBhandlerasotherhandlersisinthisarch/x86/kernel/entry_64.Sanddefinedwiththeidtentryassemblymacro:
idtentrydebugdo_debughas_error_code=0paranoid=1shift_ist=DEBUG_STACK
idtentryisamacrowhichdefinesaninterrupt/exceptionentrypoint.Asyoucanseeittakesfivearguments:
nameoftheinterruptentrypoint;nameoftheinterrupthandler;hasinterrupterrorcodeornot;paranoid-ifthisparameter=1,switchtospecialstack(readabove);shift_ist-stacktoswitchduringinterrupt.
Nowlet'slookonidtentrymacroimplementation.ThismacrodefinedinthesameassemblyfileanddefinesdebugfunctionwiththeENTRYmacro.Forthestartidtentrymacrochecksthatgivenparametersarecorrectincaseifneedtoswitchtothespecialstack.Inthenextstepitchecksthatgiveinterruptreturnserrorcode.Ifinterruptdoesnotreturnerrorcode(inourcase#DBdoesnotreturnerrorcode),itcallsINTR_FRAMEorXCPT_FRAMEifinterrupthaserrorcode.BothofthesemacrosXCPT_FRAMEandINTR_FRAMEdonothingandneedonlyforthebuildinginitialframestateforinterrupts.TheyusesCFIdirectivesandusedfordebugging.MoreinfoyoucanfindintheCFIdirectives.Ascommentfromthearch/x86/kernel/entry_64.Ssays:CFImacrosareusedtogeneratedwarf2unwindinformationforbetterbacktraces.Theydon'tchangeanycode.sowewillignorethem.
.macroidtentrysymdo_symhas_error_code:reqparanoid=0shift_ist=-1
ENTRY(\sym)
/*Sanitycheck*/
.if\shift_ist!=-1&&\paranoid==0
.error"usingshift_istrequiresparanoid=1"
.endif
.if\has_error_code
XCPT_FRAME
.else
INTR_FRAME
.endif
...
DBhandler
LinuxInside
101Continuearchitecture-specificboot-timeinitializations
...
...
Youcanrememberfromthepreviouspartaboutearlyinterrupts/exceptionshandlingthatafterinterruptoccurs,currentstackwillhavefollowingformat:
+-----------------------+
||
+40|SS|
+32|RSP|
+24|RFLAGS|
+16|CS|
+8|RIP|
0|ErrorCode|<----rsp
||
+-----------------------+
Thenexttwomacrofromtheidtentryimplementationare:
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
FirstASM_CLACmacrodependsonCONFIG_X86_SMAPconfigurationoptionandneedforsecurityresason,moreaboutityoucanreadhere.ThesecondPARAVIRT_ADJUST_EXCEPTION_FRAMEmacroisforhandlinghandleXen-type-exceptions(thischapteraboutkernelinitializationsandwewillnotconsidervirtualizationstuffhere).
Thenextpieceofcodechecksisinterrupthaserrorcodeornotandpushes$-1whichis0xffffffffffffffffonx86_64onthestackifnot:
.ifeq\has_error_code
pushq_cfi$-1
.endif
Weneedtodoitasdummyerrorcodeforstackconsistencyforallinterrupts.Inthenextstepwesubscractfromthestackpointer$ORIG_RAX-R15:
subq$ORIG_RAX-R15,%rsp
whereORIRG_RAX,R15andothermacrosdefinedinthearch/x86/include/asm/calling.handORIG_RAX-R15is120bytes.Generalpurposeregisterswilloccupythese120bytesbecauseweneedtostoreallregistersonthestackduringinterrupthandling.Afterwesetstackforgeneralpurposeregisters,thenextstepischeckingthatinterruptcamefromuserspacewith:
testl$3,CS(%rsp)
jnz1f
HerewechecksfirstandsecondbitsintheCS.YoucanrememberthatCSregistercontainssegmentselectorwherefirsttwobitsareRPL.Allprivilegelevelsareintegersintherange0–3,wherethelowestnumbercorrespondstothehighestprivilege.Soifinterruptcamefromthekernelmodewecallsave_paranoidorjumponlabel1ifnot.Inthesave_paranoidwestoreallgeneralpurposeregistersonthestackandswitchusergsonkernelgsifneed:
movl$1,%ebx
LinuxInside
102Continuearchitecture-specificboot-timeinitializations
movl$MSR_GS_BASE,%ecx
rdmsr
testl%edx,%edx
js1f
SWAPGS
xorl%ebx,%ebx
1:ret
Inthenextstepsweputpt_regspointertotherdi,saveerrorcodeinthersiifitisandcallinterrupthandlerwhichis-do_debuginourcasefromthearch/x86/kernel/traps.c.do_debuglikeotherhandlerstakestwoparameters:
pt_regs-isastructurewhichpresentssetofCPUregisterswhicharesavedintheprocess'memoryregion;errorcode-errorcodeofinterrupt.
Afterinterrupthandlerfinisheditswork,callsparanoid_exitwhichrestoresstack,switchonuserspaceifinterruptcamefromthereandcallsiret.That'sall.Ofcourseitisnotall:),butwewillseemoredeeplyintheseparatechapteraboutinterrupts.
Thisisgeneralviewoftheidtentrymacrofor#DBinterrupt.Allinterruptsaresimilaronthisimplementationanddefinedwithidtentrytoo.Afterearly_trap_initfinisheditswork,thenextfunctionisearly_cpu_init.Thisfunctiondefinedinthearch/x86/kernel/cpu/common.candcollectsinformationaboutaCPUanditsvendor.
Thenextstepisinitializationofearlyioremap.Ingeneraltherearetwowaystocomminicatewithdevices:
I/OPorts;Devicememory.
Wealreadysawfirstmethod(outb/inbinstructions)inthepartaboutlinuxkernelbootingprocess.ThesecondmethodistomapI/Ophysicaladdressestovirtualaddresses.WhenaphysicaladdressisaccessedbytheCPU,itmayrefertoaportionofphysicalRAMwhichcanbemappedonmemoryoftheI/Odevice.Soioremapusedtomapdevicememoryintokerneladdressspace.
Asiwroteabovenextfunctionistheearly_ioremap_initwhichre-mapsI/Omemorytokerneladdressspacesoitcanaccessit.WeneedtoinitializeearlyioremapforearlyinitializationcodewhichneedstotemporarilymapI/Oormemoryregionsbeforethenormalmappingfunctionslikeioremapareavailable.Implementationofthisfunctionisinthearch/x86/mm/ioremap.c.Atthestartoftheearly_ioremap_initwecanseedefinitionofthepmdpointwithpmd_ttype(whichpresentspagemiddledirectoryentrytypedefstruct{pmdval_tpmd;}pmd_t;wherepmdval_tisunsignedlong)andmakeacheckthatfixmapalignedinacorrectway:
pmd_t*pmd;
BUILD_BUG_ON((fix_to_virt(0)+PAGE_SIZE)&((1<<PMD_SHIFT)-1));
fixmap-isfixedvirtualaddressmappingswhichextendsfromFIXADDR_STARTtoFIXADDR_TOP.Fixedvirtualaddressesareneededforsubsystemsthatneedtoknowthevirtualaddressatcompiletime.Afterthecheckearly_ioremap_initmakesacalloftheearly_ioremap_setupfunctionfromthemm/early_ioremap.c.early_ioremap_setupfillsslot_virtarryoftheunsignedlongwithvirtualaddresseswith512temporaryboot-timefix-mappings:
for(i=0;i<FIX_BTMAPS_SLOTS;i++)
slot_virt[i]=__fix_to_virt(FIX_BTMAP_BEGIN-NR_FIX_BTMAPS*i);
AfterthiswegetpagemiddledirectoryentryfortheFIX_BTMAP_BEGINandputtothepmdvariable,fillswithzerosbm_pte
Earlyioremapinitialization
LinuxInside
103Continuearchitecture-specificboot-timeinitializations
whichisboottimepagetablesandcallpmd_populate_kernelfunctionforsettinggivenpagetableentryinthegivenpagemiddledirectory:
pmd=early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte,0,sizeof(bm_pte));
pmd_populate_kernel(&init_mm,pmd,bm_pte);
That'sallforthis.Ifyoufeelingmissunderstanding,don'tworry.ThereisspecialpartaboutioremapandfixmapsintheLinuxKernelMemoryManagement.Part2chapter.
Afterearlyioremapwasinitialized,youcanseethefollowingcode:
ROOT_DEV=old_decode_dev(boot_params.hdr.root_dev);
Thiscodeobtainsmajorandminornumbersfortherootdevicewhereinitrdwillbemountedlaterinthedo_mount_rootfunction.Majornumberofthedeviceidentifiesadriverassociatedwiththedevice.Minornumberreferredonthedevicecontrolledbydriver.Notethatold_decode_devtakesoneparameterfromtheboot_params_structure.Aswecanreadfromthex86linuxkernelbootprotocol:
Fieldname:root_dev
Type:modify(optional)
Offset/size:0x1fc/2
Protocol:ALL
Thedefaultrootdevicedevicenumber.Theuseofthisfieldis
deprecated,usethe"root="optiononthecommandlineinstead.
Nowlet'stryunderstandwhatisitold_decode_dev.ActuallyitjustcallsMKDEVinsidewhichgeneratesdev_tfromthegivemajorandminornumbers.It'simplementationprettyeasy:
staticinlinedev_told_decode_dev(u16val)
{
returnMKDEV((val>>8)&255,val&255);
}
wheredev_tisakerneldatatypetopresentmajor/minornumberpair.Butwhat'sthestrangeold_prefix?Forhistoricalreasons,therearetwowaysofmanagingthemajorandminornumbersofadevice.Inthefirstwaymajorandminornumbersoccupied2bytes.Youcanseeitinthepreviouscode:8bitformajornumberand8bitforminornumber.Butthereisproblemwiththisway:256majornumbersand256minornumbersarepossible.So16-bitintegerwasreplacedwith32-bitintegerwhere12bitsreservedformajornumberand20bitsforminor.Youcanseethisinthenew_decode_devimplementation:
staticinlinedev_tnew_decode_dev(u32dev)
{
unsignedmajor=(dev&0xfff00)>>8;
unsignedminor=(dev&0xff)|((dev>>12)&0xfff00);
returnMKDEV(major,minor);
}
Aftercalculationwewillget0xfffor12bitsformajorifitis0xffffffffand0xfffffor20bitsforminor.Sointheendof
Obtainingmajorandminornumbersfortherootdevice
LinuxInside
104Continuearchitecture-specificboot-timeinitializations
executionoftheold_decode_devwewillgetmajorandminornumbersfortherootdeviceinROOT_DEV.
Thenextpointisthesetupofthememorymapwiththecallofthesetup_memory_mapfunction.Butbeforethiswesetupdifferentparametersasinformationaboutascreen(currentrowandcolumn,videopageandetc...(youcanreadaboutitintheVideomodeinitializationandtransitiontoprotectedmode)),Extendeddisplayidentificationdata,videomode,bootloader_typeandetc...:
screen_info=boot_params.screen_info;
edid_info=boot_params.edid_info;
saved_video_mode=boot_params.hdr.vid_mode;
bootloader_type=boot_params.hdr.type_of_loader;
if((bootloader_type>>4)==0xe){
bootloader_type&=0xf;
bootloader_type|=(boot_params.hdr.ext_loader_type+0x10)<<4;
}
bootloader_version=bootloader_type&0xf;
bootloader_version|=boot_params.hdr.ext_loader_ver<<4;
Alloftheseparameterswegotduringboottimeandstoredintheboot_paramsstructure.AfterthisweneedtosetuptheendoftheI/Omemory.Asyouknowtheoneofthemainpurposesofthekernelisresourcemanagement.Andoneoftheresourceisamemory.AswealreadyknowtherearetwowaystocommunicatewithdevicesareI/Oportsanddevicememory.Allinformationaboutregisteredresourcesavailablethrough:
/proc/ioports-providesalistofcurrentlyregisteredportregionsusedforinputoroutputcommunicationwithadevice;/proc/iomem-providescurrentmapofthesystem'smemoryforeachphysicaldevice.
Atthemomentweareinterestedin/proc/iomem:
cat/proc/iomem
00000000-00000fff:reserved
00001000-0009d7ff:SystemRAM
0009d800-0009ffff:reserved
000a0000-000bffff:PCIBus0000:00
000c0000-000cffff:VideoROM
000d0000-000d3fff:PCIBus0000:00
000d4000-000d7fff:PCIBus0000:00
000d8000-000dbfff:PCIBus0000:00
000dc000-000dffff:PCIBus0000:00
000e0000-000fffff:reserved
000e0000-000e3fff:PCIBus0000:00
000e4000-000e7fff:PCIBus0000:00
000f0000-000fffff:SystemROM
Asyoucanseerangeofaddressesareshowninhexadecimalnotationwithitsowner.LinuxkernelprovidesAPIformanaginganyresourcesinageneralway.Globalresources(forexamplePICsorI/Oports)canbedividedintosubsets-relatingtoanyhardwarebusslot.Themainstructureresource:
structresource{
resource_size_tstart;
resource_size_tend;
constchar*name;
unsignedlongflags;
structresource*parent,*sibling,*child;
};
presentsabstractionforatree-likesubsetofsystemresources.Thisstructureprovidesrangeofaddressesfromstartto
Memorymapsetup
LinuxInside
105Continuearchitecture-specificboot-timeinitializations
end(resource_size_tisphys_addr_toru64forx86_64)whicharesourcecovers,nameofaresource(youseethesenamesinthe/proc/iomemoutput)andflagsofaresource(Allresourcesflagsdefinedintheinclude/linux/ioport.h).Thelastarethreepointerstotheresourcestructure.Thesepointersenableatree-likestructure:
+-------------++-------------+
||||
|parent|------|sibling|
||||
+-------------++-------------+
|
|
+-------------+
||
|child|
||
+-------------+
Everysubsetofresourceshasrootrangeresources.Foriomemitisiomem_resourcewhichdefinedas:
structresourceiomem_resource={
.name="PCImem",
.start=0,
.end=-1,
.flags=IORESOURCE_MEM,
};
EXPORT_SYMBOL(iomem_resource);
TODOEXPORT_SYMBOL
iomem_resourcedefinesrootaddressesrangeforiomemorywithPCImemnameandIORESOURCE_MEM(0x00000200)asflags.Asiwroteaboutourcurrentpointissetuptheendaddressoftheiomem.Wewilldoitwith:
iomem_resource.end=(1ULL<<boot_cpu_data.x86_phys_bits)-1;
Hereweshift1onboot_cpu_data.x86_phys_bits.boot_cpu_dataiscpuinfo_x86structurewhichwefilledduringexecutionoftheearly_cpu_init.Asyoucanunderstandfromthenameofthex86_phys_bitsfield,itpresentsmaximumbitsamountofthemaximumphysicaladdressinthesystem.Notealsothatiomem_resourcepassedtotheEXPORT_SYMBOLmacro.Thismacroexportsthegivensymbol(iomem_resourceinourcase)fordynamiclinkingorinanotherwordsitmakesasymbolaccessibletodynamicallyloadedmodules.
Aswesettheendaddressoftherootiomemresourceaddressrange,asIwroteaboutthenextstepwillbesetupofthememorymap.Itwillbeproducedwiththecallofthesetup_memory_mapfunction:
void__initsetup_memory_map(void)
{
char*who;
who=x86_init.resources.memory_setup();
memcpy(&e820_saved,&e820,sizeof(structe820map));
printk(KERN_INFO"e820:BIOS-providedphysicalRAMmap:\n");
e820_print_map(who);
}
Firstofallwecalllookherethecallofthex86_init.resources.memory_setup.x86_initisax86_init_opsstructurewhichpresentsplatformspecificsetupfunctionsasresourcesinitializtion,pciinitializationandetc...Initiaizationofthex86_initisinthearch/x86/kernel/x86_init.c.Iwillnotgiveherethefulldescriptionbecauseitisverylong,butonlyonepartwhichinterestsusfornow:
LinuxInside
106Continuearchitecture-specificboot-timeinitializations
structx86_init_opsx86_init__initdata={
.resources={
.probe_roms=probe_roms,
.reserve_resources=reserve_standard_io_resources,
.memory_setup=default_machine_specific_memory_setup,
},
...
...
...
}
Aswecanseeherememry_setupfieldisdefault_machine_specific_memory_setupwherewegetthenumberofthee820entrieswhichwecollectedintheboottime,sanitizetheBIOSe820mapandfille820mapstructurewiththememoryregions.Asallregionscollect,printofallregionswithprintk.Youcanfindthisprintifyouexecutedmesgcommand,youmustseesomethinglikethis:
[0.000000]e820:BIOS-providedphysicalRAMmap:
[0.000000]BIOS-e820:[mem0x0000000000000000-0x000000000009d7ff]usable
[0.000000]BIOS-e820:[mem0x000000000009d800-0x000000000009ffff]reserved
[0.000000]BIOS-e820:[mem0x00000000000e0000-0x00000000000fffff]reserved
[0.000000]BIOS-e820:[mem0x0000000000100000-0x00000000be825fff]usable
[0.000000]BIOS-e820:[mem0x00000000be826000-0x00000000be82cfff]ACPINVS
[0.000000]BIOS-e820:[mem0x00000000be82d000-0x00000000bf744fff]usable
[0.000000]BIOS-e820:[mem0x00000000bf745000-0x00000000bfff4fff]reserved
[0.000000]BIOS-e820:[mem0x00000000bfff5000-0x00000000dc041fff]usable
[0.000000]BIOS-e820:[mem0x00000000dc042000-0x00000000dc0d2fff]reserved
[0.000000]BIOS-e820:[mem0x00000000dc0d3000-0x00000000dc138fff]usable
[0.000000]BIOS-e820:[mem0x00000000dc139000-0x00000000dc27dfff]ACPINVS
[0.000000]BIOS-e820:[mem0x00000000dc27e000-0x00000000deffefff]reserved
[0.000000]BIOS-e820:[mem0x00000000defff000-0x00000000deffffff]usable
...
...
...
Thenexttwostepsisparsingofthesetup_datawithparse_setup_datafunctionandcopyingBIOSEDDtothesafeplace.setup_dataisafieldfromthekernelbootheaderandaswecanreadfromthex86bootprotocol:
Fieldname:setup_data
Type:write(special)
Offset/size:0x250/8
Protocol:2.09+
The64-bitphysicalpointertoNULLterminatedsinglelinkedlistof
structsetup_data.Thisisusedtodefineamoreextensibleboot
parameterspassingmechanism.
Itusedforstoringsetupinformationfordifferenttypesasdevicetreeblob,EFIsetupdataandetc...InthesecondstepwecopyBIOSEDDinformantionfromtheboot_paramsstructurethatwecollectedinthearch/x86/boot/edd.ctotheeddstructure:
staticinlinevoid__initcopy_edd(void)
{
memcpy(edd.mbr_signature,boot_params.edd_mbr_sig_buffer,
sizeof(edd.mbr_signature));
memcpy(edd.edd_info,boot_params.eddbuf,sizeof(edd.edd_info));
edd.mbr_signature_nr=boot_params.edd_mbr_sig_buf_entries;
edd.edd_info_nr=boot_params.eddbuf_entries;
}
CopyingoftheBIOSEnhancedDiskDeviceinformation
LinuxInside
107Continuearchitecture-specificboot-timeinitializations
Thenextstepisinitializationofthememorydescriptoroftheinitprocess.Asyoualreadycanknoweveryprocesshasownaddressspace.Thisaddressspacepresentedwithspecialdatastructurewhichcalledmemorydescriptor.Directlyinthelinuxkernelsourcecodememorydescriptorpresentedwithmm_structstructure.mm_structcontainsmanydifferentfieldsrelatedwiththeprocessaddressspaceasstart/endaddressofthekernelcode/data,start/endofthebrk,numberofmemoryareas,listofmemoryareasandetc...Thisstructuredefinedintheinclude/linux/mm_types.h.Aseveryprocesshasownmemorydescriptor,task_structstructurecontainsitinthemmandactive_mmfield.Andourfirstinitprocesshasittoo.Youcanrememberthatwesawthepartofinitializationoftheinittask_structwithINIT_TASKmacrointhepreviouspart:
#defineINIT_TASK(tsk)\
{
...
...
...
.mm=NULL,\
.active_mm=&init_mm,\
...
}
mmpointstotheprocessaddressspaceandactive_mmpointstotheactiveaddressspaceifprocesshasnoownaskernelthreads(moreaboutityoucanreadinthedocumentation).Nowwefillmemorydescriptoroftheinitialprocess:
init_mm.start_code=(unsignedlong)_text;
init_mm.end_code=(unsignedlong)_etext;
init_mm.end_data=(unsignedlong)_edata;
init_mm.brk=_brk_end;
withthekernel'stext,dataandbrk.init_mmismemorydescriptoroftheinitialprocessanddefinedas:
structmm_structinit_mm={
.mm_rb=RB_ROOT,
.pgd=swapper_pg_dir,
.mm_users=ATOMIC_INIT(2),
.mm_count=ATOMIC_INIT(1),
.mmap_sem=__RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock=__SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.mmlist=LIST_HEAD_INIT(init_mm.mmlist),
INIT_MM_CONTEXT(init_mm)
};
wheremm_rbisared-blacktreeofthevirtualmemoryareas,pgdisapointertothepageglobaldirectory,mm_usersisaddressspaceusers,mm_countisprimaryusagecounterandmmap_semismemoryareasemaphore.Afterthatwesetupmemorydescriptoroftheinitialiprocess,nextstepisinitializationoftheintelMemoryProtectionExtensionswithmpx_mm_init.Thenextstepafteritisinitializationofthecode/data/bssresourceswith:
code_resource.start=__pa_symbol(_text);
code_resource.end=__pa_symbol(_etext)-1;
data_resource.start=__pa_symbol(_etext);
data_resource.end=__pa_symbol(_edata)-1;
bss_resource.start=__pa_symbol(__bss_start);
bss_resource.end=__pa_symbol(__bss_stop)-1;
Wealreadyknowalittleaboutresourcestructure(readabove).Herewefillscode/data/bssresourceswiththephysicaladdressesofthey.Youcanseeitinthe/proc/iomemoutput:
Memorydescriptorinitialization
LinuxInside
108Continuearchitecture-specificboot-timeinitializations
00100000-be825fff:SystemRAM
01000000-015bb392:Kernelcode
015bb393-01930c3f:Kerneldata
01a11000-01ac3fff:Kernelbss
Allofthesestructuresdefinedinthearch/x86/kernel/setup.candlookliketypicalresourceinitialization:
staticstructresourcecode_resource={
.name="Kernelcode",
.start=0,
.end=0,
.flags=IORESOURCE_BUSY|IORESOURCE_MEM
};
ThelaststepwhichwewillcoverinthispartwillbeNXconfiguration.NX-bitornoexecutebitis63-bitinthepagedirectoryentrywhichcontrolstheabilitytoexecutecodefromallphysicalpagesmappedbythetableentry.Thisbitcanonlybeused/setwhentheno-executepage-protectionmechanismisenabledbythesettingEFER.NXEto1.Inthex86_configure_nxfunctionwecheckthatCPUhassupportofNX-bitanditdoesnotdisabled.Afterthecheckwefill__supported_pte_maskdependonit:
voidx86_configure_nx(void)
{
if(cpu_has_nx&&!disable_nx)
__supported_pte_mask|=_PAGE_NX;
else
__supported_pte_mask&=~_PAGE_NX;
}
Itistheendofthefifthpartaboutlinuxkernelinitializationprocess.Inthispartwecontinuedtodiveinthesetup_archfunctionwhichmakesinitializationofarchitecutre-specificstuff.Itwaslongpart,butwenotfinishedwithit.Asialreadywrote,thesetup_archisbigfunction,andIamreallynotsurethatwewillcoverfullofiteveninthenextpart.ThereweresomenewinterestingconceptsinthispartlikeFix-mappedaddresses,ioremapandetc...Don'tworryiftheyareunclearforyou.Thereisspecialpartabouttheseconcepts-LinuxkernelmemorymanagementPart2..Inthenextpartwewillcontinuewiththeinitializationofthearchitecture-specificstuffandwillseeparsingoftheearlykernelparameteres,earlydumpofthepcidevices,directMediaInterfacescanningandmanymanymore.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
mmvsactive_mme820SupervisormodeaccesspreventionKernelstacksTSSIDTMemorymappedI/OCFIdirectives
Conclusion
Links
LinuxInside
109Continuearchitecture-specificboot-timeinitializations
PDF.dwarf4specificationCallstackPreviouspart
LinuxInside
110Continuearchitecture-specificboot-timeinitializations
Inthepreviouspartwesawarchitecture-specific(x86_64inourcase)initializationstufffromthearch/x86/kernel/setup.candfinishedonx86_configure_nxfunctionwhichsetsthe_PAGE_NXflagdependsonsupportofNXbit.AsIwrotebeforesetup_archfunctionandstart_kernelareverybig,sointhisandinthenextpartwewillcontinuetolearnaboutarchitecture-specificinitializationprocess.Thenextfunctionafterx86_configure_nxisparse_early_param.Thisfunctiondefinedintheinit/main.candasyoucanunderstandfromitsname,thisfunctionparseskernelcommandlineandsetupsdifferentsomeservicesdependsongiveparameters(allkernelcommandlineparametersyoucanfindintheDocumentation/kernel-parameters.txt).Youcanrememberhowwesetupearlyprintkintheearliestpart.Ontheearlystagewelookedforkernelparametersandtheirvaluewiththecmdline_find_optionfunctionand__cmdline_find_option,__cmdline_find_option_boolhelpersfromthearch/x86/boot/cmdline.c.Therewe'reinthegenerickernelpartwhichdoesnotdependonarchitectureandhereweuseanotherapproach.Ifyouarereadinglinuxkernelsourcecode,youalreadycannotecallslikethis:
early_param("gbpages",parse_direct_gbpages_on);
early_parammacrotakestwoparameters:
commandlineparametername;functionwhichwillbecalledifgivenparameterpassed.
anddefinedas:
#defineearly_param(str,fn)\
__setup_param(str,fn,fn,1)
intheinclude/linux/init.h.Asyoucanseeearly_parammacrojustmakescallofthe__setup_parammacro:
#define__setup_param(str,unique_id,fn,early)\
staticconstchar__setup_str_##unique_id[]__initconst\
__aligned(1)=str;\
staticstructobs_kernel_param__setup_##unique_id\
__used__section(.init.setup)\
__attribute__((aligned((sizeof(long)))))\
={__setup_str_##unique_id,fn,early}
Thismacrodefines__setup_str_*_idvariable(where*dependsongivenfunctionname)andassignsittothegivencommandlineparametername.Inthenextlinewecanseedefinitionofthe__setup_*variablewhichtypeisobs_kernel_paramanditsinitialization.obs_kernel_paramstructuredefinedas:
structobs_kernel_param{
constchar*str;
int(*setup_func)(char*);
intearly;
};
andcontainsthreefields:
Kernelinitialization.Part6.
Architecture-specificinitializations,again...
LinuxInside
111Architecture-specificinitializations,again...
nameofthekernelparameter;functionwhichsetupssomethingdependonparameter;fielddeterminiesisparameterearly(1)ornot(0).
Notethat__set_parammacrodefineswith__section(.init.setup)attribute.Itmeansthatall__setup_str_*willbeplacedinthe.init.setupsection,moreover,aswecanseeintheinclude/asm-generic/vmlinux.lds.h,theywillbeplacedbetween__setup_startand__setup_end:
#defineINIT_SETUP(initsetup_align)\
.=ALIGN(initsetup_align);\
VMLINUX_SYMBOL(__setup_start)=.;\
*(.init.setup)\
VMLINUX_SYMBOL(__setup_end)=.;
Nowweknowhowparametersaredefined,let'sbacktotheparse_early_paramimplementation:
void__initparse_early_param(void)
{
staticintdone__initdata;
staticchartmp_cmdline[COMMAND_LINE_SIZE]__initdata;
if(done)
return;
/*Allfallthroughtodo_early_param.*/
strlcpy(tmp_cmdline,boot_command_line,COMMAND_LINE_SIZE);
parse_early_options(tmp_cmdline);
done=1;
}
Theparse_early_paramfunctiondefinestwostaticvariables.Firstdonecheckthatparse_early_paramalreadycalledandthesecondistemporarystorageforkernelcommandline.Afterthiswecopyboot_command_linetothetemporarycommadlinewhichwejustdefinedandcalltheparse_early_optionsfunctionfromthethesamesourcecodemain.cfile.parse_early_optionscallstheparse_argsfunctionfromthekernel/params.cwhereparse_argsparsesgivencommandlineandcallsdo_early_paramfunction.Thisfunctiongoesfromthe__setup_startto__setup_end,andcallsthefunctionfromtheobs_kernel_paramifaparameterisearly.Afterthisallserviceswhicharedependonearlycommandlineparametersweresetupandthenextcallaftertheparse_early_paramisx86_report_nx.AsIwroteinthebeginningofthispart,wealreadysetNX-bitwiththex86_configure_nx.Thenextx86_report_nxfunctionthearch/x86/mm/setup_nx.cjustprintsinformationabouttheNX.Notethatwecallx86_report_nxnotrightafterthex86_configure_nx,butafterthecalloftheparse_early_param.Theanswerissimple:wecallitaftertheparse_early_parambecausethekernelsupportnoexecparameter:
noexec[X86]
OnX86-32availableonlyonPAEconfiguredkernels.
noexec=on:enablenon-executablemappings(default)
noexec=off:disablenon-executablemappings
Wecanseeitinthebootingtime:
Afterthiswecanseecallofthe:
memblock_x86_reserve_range_setup_data();
LinuxInside
112Architecture-specificinitializations,again...
function.Thisfunctiondefinedinthesamearch/x86/kernel/setup.csourcecodefileandremapsmemoryforthesetup_dataandreservedmemoryblockforthesetup_data(moreaboutsetup_datayoucanreadinthepreviouspartandaboutioremapandmemblockyoucanreadintheLinuxkernelmemorymanagement).
Inthenextstepwecanseefollowingconditionalstatement:
if(acpi_mps_check()){
#ifdefCONFIG_X86_LOCAL_APIC
disable_apic=1;
#endif
setup_clear_cpu_cap(X86_FEATURE_APIC);
}
Thefirstacpi_mps_checkfunctionfromthearch/x86/kernel/acpi/boot.cdependsonCONFIG_X86_LOCAL_APICandCNOFIG_x86_MPPARSEconfigurationoptions:
int__initacpi_mps_check(void)
{
#ifdefined(CONFIG_X86_LOCAL_APIC)&&!defined(CONFIG_X86_MPPARSE)
/*mptablecodeisnotbuilt-in*/
if(acpi_disabled||acpi_noirq){
printk(KERN_WARNING"MPSsupportcodeisnotbuilt-in.\n"
"Usingacpi=offoracpi=noirqorpci=noacpi"
"mayhaveproblem\n");
return1;
}
#endif
return0;
}
Itchecksthebuilt-inMPSorMultiProcessorSpecificationtable.IfCONFIG_X86_LOCAL_APICissetandCONFIG_x86_MPPAARSEisnotset,acpi_mps_checkprintswarningmessageiftheoneofthecommandlineoptions:acpi=off,acpi=noirqorpci=noacpipassedtothekernel.Ifacpi_mps_checkreturns1whichmeansthat
wedisablelocalAPICandclearsX86_FEATURE_APICbitintheofthecurrentCPUwiththesetup_clear_cpu_capmacro.(moreaboutCPUmaskyoucanreadintheCPUmasks).
InthenextstepwemakeadumpofthePCIdeviceswiththefollowingcode:
#ifdefCONFIG_PCI
if(pci_early_dump_regs)
early_dump_pci_devices();
#endif
pci_early_dump_regsvariabledefinedinthearch/x86/pci/common.canditsvaluedependsonthekernelcommandlineparameter:pci=earlydump.Wecanfinddefitionofthisparameterinthedrivers/pci/pci.c:
early_param("pci",pci_setup);
pci_setupfunctiongetsthestringafterthepci=andanalyzesit.Thisfunctioncallspcibios_setupwhichdefinedas__weakinthedrivers/pci/pci.candeveryarchitecturedefinesthesamefunctionwhichoverrides__weakanalog.Forexamplex86_64architecture-depenedversionisinthearch/x86/pci/common.c:
EarlyPCIdump
LinuxInside
113Architecture-specificinitializations,again...
char*__initpcibios_setup(char*str){
...
...
...
}elseif(!strcmp(str,"earlydump")){
pci_early_dump_regs=1;
returnNULL;
}
...
...
...
}
So,ifCONFIG_PCIoptionissetandwepassedpci=earlydumpoptiontothekernelcommandline,nextfunctionwhichwillbecalled-early_dump_pci_devicesfromthearch/x86/pci/early.c.Thisfunctionchecksnoearlypciparameterwith:
if(!early_pci_allowed())
return;
andreturnsifitwaspassed.EachPCIdomaincanhostupto256busesandeachbushostsupto32devices.So,wegoesinaloop:
for(bus=0;bus<256;bus++){
for(slot=0;slot<32;slot++){
for(func=0;func<8;func++){
...
...
...
}
}
}
andreadthepciconfigwiththeread_pci_configfunction.
That'sall.Wewillnogodeepinthepcidetails,butwillseemoredetailsinthespecialDrivers/PCIpart.
Aftertheearly_dump_pci_devices,thereareacoupleoffunctionrelatedwithavailablememoryande820whichwecollectedintheFirststepsinthekernelsetuppart:
/*updatethee820_savedtoo*/
e820_reserve_setup_data();
finish_e820_parsing();
...
...
...
e820_add_kernel_range();
trim_bios_range(void);
max_pfn=e820_end_of_ram_pfn();
early_reserve_e820_mpc_new();
Let'slookonit.Asyoucanseethefirstfunctionise820_reserve_setup_data.Thisfunctiondoesalmostthesameasmemblock_x86_reserve_range_setup_datawhichwesawabove,butitalsocallse820_update_rangewhichaddsnewregionstothee820mapwiththegiventypewhichisE820_RESERVED_KERNinourcase.Thenextfunctionisfinish_e820_parsingwhichsanitazese820mapwiththesanitize_e820_mapfunction.Besidesthistwofunctionswecanseeacoupleoffunctionsrelatedtothee820.Youcanseeitinthelistingwhichisabove.e820_add_kernel_rangefunctiontakesthephysicaladdressofthe
Finishwithmemoryparsing
LinuxInside
114Architecture-specificinitializations,again...
kernelstartandend:
u64start=__pa_symbol(_text);
u64size=__pa_symbol(_end)-start;
checksthat.text.dataand.bssmarkedasE820RAMinthee820mapandprintsthewarningmessageifnot.Thenextfunctiontrm_bios_rangeupdatefirst4096bytesine820MapasE820_RESERVEDandsanitizesitagainwiththecallofthesanitize_e820_map.Afterthiswegetthelastpageframenumberwiththecallofthee820_end_of_ram_pfnfunction.Everymemorypagehasanuniquenumber-Pageframenumberande820_end_of_ram_pfnfunctionreturnsthemaximumwiththecallofthee820_end_pfn:
unsignedlong__inite820_end_of_ram_pfn(void)
{
returne820_end_pfn(MAX_ARCH_PFN);
}
wheree820_end_pfntakesmaximumpageframenumberonthecertainarchitecture(MAX_ARCH_PFNis0x400000000forx86_64).Inthee820_end_pfnwegothroughthealle820slotsandcheckthate820entryhasE820_RAMorE820_PRAMtypebecausewecalcluatepageframenumbersonlyforthesetypes,getsthebaseaddressandendaddressofthepageframenumberforthecurrente820entryandmakessomechecksfortheseaddresses:
for(i=0;i<e820.nr_map;i++){
structe820entry*ei=&e820.map[i];
unsignedlongstart_pfn;
unsignedlongend_pfn;
if(ei->type!=E820_RAM&&ei->type!=E820_PRAM)
continue;
start_pfn=ei->addr>>PAGE_SHIFT;
end_pfn=(ei->addr+ei->size)>>PAGE_SHIFT;
if(start_pfn>=limit_pfn)
continue;
if(end_pfn>limit_pfn){
last_pfn=limit_pfn;
break;
}
if(end_pfn>last_pfn)
last_pfn=end_pfn;
}
if(last_pfn>max_arch_pfn)
last_pfn=max_arch_pfn;
printk(KERN_INFO"e820:last_pfn=%#lxmax_arch_pfn=%#lx\n",
last_pfn,max_arch_pfn);
returnlast_pfn;
Afterthiswecheckthatlast_pfnwhichwegotintheloopisnotgreaterthatmaximumpageframenumberforthecertainarchitecture(x86_64inourcase),printinofmrationaboutlastpageframenumberandreturnit.Wecanseethelast_pfninthedmesgoutput:
...
[0.000000]e820:last_pfn=0x41f000max_arch_pfn=0x400000000
...
LinuxInside
115Architecture-specificinitializations,again...
Afterthis,aswehavecalculatedthebiggestpageframenumber,wecalculatemax_low_pfnwhichisthebiggestpageframenumberinthelowmemoryorbellowfirst4gigabytes.Ifinstalledmorethan4gigabytesofRAM,max_low_pfnwillberesultofthee820_end_of_low_ram_pfnfunctionwhichdoesthesamee820_end_of_ram_pfnbutwith4gigabyteslimit,inotherwaymax_low_pfnwillbethesameasmax_pfn:
if(max_pfn>(1UL<<(32-PAGE_SHIFT)))
max_low_pfn=e820_end_of_low_ram_pfn();
else
max_low_pfn=max_pfn;
high_memory=(void*)__va(max_pfn*PAGE_SIZE-1)+1;
Nextwecalculatehigh_memory(definestheupperboundondirectmapmemory)with__vamacrowhichreturnsavirtualaddressbythegivenphysical.
Thenextstepaftermanipulationswithdifferentmemoryregionsande820slotsiscollectinginformationaboutcomputer.WewillgetallinformationwiththeDesktopManagementInterfaceandfollowingfunctions:
dmi_scan_machine();
dmi_memdev_walk();
Firstisdmi_scan_machinedefinedinthedrivers/firmware/dmi_scan.c.ThisfunctiongoesthroughtheSystemManagementBIOSstructuresandextractsinformantion.TherearetwowaysspecifiedtogainaccesstotheSMBIOStable:getthepointertotheSMBIOStablefromtheEFI'sconfigurationtableandscanningthephysycalmemorybetween0xF0000and0x10000addresses.Let'slookonthesecondapproach.dmi_scan_machinefunctionremapsmemorybetween0xf0000and0x10000withthedmi_early_remapwhichjustexpandstotheearly_ioremap:
void__initdmi_scan_machine(void)
{
char__iomem*p,*q;
charbuf[32];
...
...
...
p=dmi_early_remap(0xF0000,0x10000);
if(p==NULL)
gotoerror;
anditeratesoverallDMIheaderaddressandfindsearch_SM_string:
memset(buf,0,16);
for(q=p;q<p+0x10000;q+=16){
memcpy_fromio(buf+16,q,16);
if(!dmi_smbios3_present(buf)||!dmi_present(buf)){
dmi_available=1;
dmi_early_unmap(p,0x10000);
gotoout;
}
memcpy(buf,buf+16,16);
}
_SM_stringmustbebetween000F0000hand0x000FFFFF.Herewecopy16bytestothebufwithmemcpy_fromiowhichisthesamememcpyandexecutedmi_smbios3_presentanddmi_presentonthebuffer.Thesefunctionscheckthatfirst4bytesis_SM_string,getSMBIOSversionandgets_DMI_attributesasDMIstructuretablelength,tableaddressandetc...After
DMIscanning
LinuxInside
116Architecture-specificinitializations,again...
oneofthesefunctionwillfinishtoexecute,youwillseetheresultofitinthedmesgoutput:
[0.000000]SMBIOS2.7present.
[0.000000]DMI:GigabyteTechnologyCo.,Ltd.Z97X-UD5H-BK/Z97X-UD5H-BK,BIOSF606/17/2014
Intheendofthedmi_scan_machine,weunmapthepreviouslyremapedmemory:
dmi_early_unmap(p,0x10000);
Thesecondfunctionis-dmi_memdev_walk.Asyoucanunderstanditgoesovermemorydevices.Let'slookonit:
void__initdmi_memdev_walk(void)
{
if(!dmi_available)
return;
if(dmi_walk_early(count_mem_devices)==0&&dmi_memdev_nr){
dmi_memdev=dmi_alloc(sizeof(*dmi_memdev)*dmi_memdev_nr);
if(dmi_memdev)
dmi_walk_early(save_mem_devices);
}
}
ItchecksthatDMIavailable(wegotitinthepreviousfunction-dmi_scan_machine)andcollectsinformationaboutmemorydeviceswithdmi_walk_earlyanddmi_allocwhichdefinedas:
#ifdefCONFIG_DMI
RESERVE_BRK(dmi_alloc,65536);
#endif
RESERVE_BRKdefinedinthearch/x86/include/asm/setup.handreservesspacewithgivensizeinthebrksection.
init_hypervisor_platform();
x86_init.resources.probe_roms();
insert_resource(&iomem_resource,&code_resource);
insert_resource(&iomem_resource,&data_resource);
insert_resource(&iomem_resource,&bss_resource);
early_gart_iommu_check();
ThenextstepisparsingoftheSMPconfiguration.Wedoitwiththecallofthefind_smp_configfunctionwhichjustcallsfunction:
staticinlinevoidfind_smp_config(void)
{
x86_init.mpparse.find_smp_config();
}
inside.x86_init.mpparse.find_smp_configisadefault_find_smp_configfunctionfromthearch/x86/kernel/mpparse.c.Inthedefault_find_smp_configfunctionwearescanningacoupleofmemoryregionsforSMPconfigandreturniftheyarenot:
SMPconfig
LinuxInside
117Architecture-specificinitializations,again...
if(smp_scan_config(0x0,0x400)||
smp_scan_config(639*0x400,0x400)||
smp_scan_config(0xF0000,0x10000))
return;
Firstofallsmp_scan_configfunctiondefinesacoupleofvariables:
unsignedint*bp=phys_to_virt(base);
structmpf_intel*mpf;
FirstisvirtualaddressofthememoryregionwherewewillscanSMPconfig,secondisthepointertothempf_intelstructure.Let'strytounderstandwhatisitmpf_intel.Allinformationstoresinthemultiprocessorconfigurationdatastructure.mpf_intelpresentsthisstructureandlooks:
structmpf_intel{
charsignature[4];
unsignedintphysptr;
unsignedcharlength;
unsignedcharspecification;
unsignedcharchecksum;
unsignedcharfeature1;
unsignedcharfeature2;
unsignedcharfeature3;
unsignedcharfeature4;
unsignedcharfeature5;
};
Aswecanreadinthedocumentation-oneofthemainfunctionsofthesystemBIOSistoconstructtheMPfloatingpointerstructureandtheMPconfigurationtable.Andoperatingsystemmusthaveaccesstothisinformationaboutthemultiprocessorconfigurationandmpf_intelstoresthephysicaladdress(lookatsecondparameter)ofthemultiprocessorconfigurationtable.So,smp_scan_configgoinginaloopthroughthegivenmemoryrangeandtriestofindMPfloatingpointerstructurethere.ItchecksthatcurrentbytepointstotheSMPsignature,checkschecksum,checksthatmpf->specificationis1(itmustbe1or4byspecification)intheloop:
while(length>0){
if((*bp==SMP_MAGIC_IDENT)&&
(mpf->length==1)&&
!mpf_checksum((unsignedchar*)bp,16)&&
((mpf->specification==1)
||(mpf->specification==4))){
mem=virt_to_phys(mpf);
memblock_reserve(mem,sizeof(*mpf));
if(mpf->physptr)
smp_reserve_memory(mpf);
}
}
reservesgivenmemoryblockifsearchissuccessfulwithmemblock_reserveandreservesphysicaladdressofthemultiprocessorconfigurationtable.Alldocumentationaboutthisyoucanfindinthe-MultiProcessorSpecification.MoredetailsyoucanreadinthespecialpartaboutSMP.
Inthenextstepofthesetup_archwecanseethecalloftheearly_alloc_pgt_buffunctionwhichallocatesthepagetablebufferforearlystage.Thepagetablebufferwillbeplaceinthebrkarea.Let'slookonitsimplementation:
Additionalearlymemoryinitializationroutines
LinuxInside
118Architecture-specificinitializations,again...
void__initearly_alloc_pgt_buf(void)
{
unsignedlongtables=INIT_PGT_BUF_SIZE;
phys_addr_tbase;
base=__pa(extend_brk(tables,PAGE_SIZE));
pgt_buf_start=base>>PAGE_SHIFT;
pgt_buf_end=pgt_buf_start;
pgt_buf_top=pgt_buf_start+(tables>>PAGE_SHIFT);
}
Firstofallitgetthesizeofthepagetablebuffer,itwillbeINIT_PGT_BUF_SIZEwhichis(6*PAGE_SIZE)inthecurrentlinuxkernel4.0.Aswegotthesizeofthepagetablebuffer,wecallextend_brkfunctionwithtwoparameters:sizeandalign.Asyoucanunderstandfromitsname,thisfunctionextendsthebrkarea.AswecanseeinthelinuxkernellinkerscriptbrkinmemoryrightaftertheBSS:
.=ALIGN(PAGE_SIZE);
.brk:AT(ADDR(.brk)-LOAD_OFFSET){
__brk_base=.;
.+=64*1024;/*64kalignmentslopspace*/
*(.brk_reservation)/*areasbrkusershavereserved*/
__brk_limit=.;
}
Orwecanfinditwithreadelfutil:
Afterthatwegotphysicaladdressofthenewbrkwiththe__pamacro,wecalculatethebaseaddressandtheendofthepagetablebuffer.Inthenextstepaswegotpagetablebuffer,wereservememoryblockforthebrkarewiththereserve_brkfunction:
staticvoid__initreserve_brk(void)
{
if(_brk_end>_brk_start)
memblock_reserve(__pa_symbol(_brk_start),
_brk_end-_brk_start);
_brk_start=0;
}
Notethatintheendofthereserve_brk,wesetbrk_starttozero,becauseafterthiswewillnotallocateitanymore.Thenextstepafterreservingmemoryblockforthebrk,weneedtounmapout-of-rangememoryareasinthekernelmappingwiththecleanup_highmapfunction.Remeberthatkernelmappingis__START_KERNEL_mapand_end-_textorlevel2_kernel_pgtmapsthekernel_text,dataandbss.Inthestartoftheclean_high_mapwedefinetheseparameters:
unsignedlongvaddr=__START_KERNEL_map;
unsignedlongend=roundup((unsignedlong)_end,PMD_SIZE)-1;
pmd_t*pmd=level2_kernel_pgt;
pmd_t*last_pmd=pmd+PTRS_PER_PMD;
Now,aswedefinedstartandendofthekernelmapping,wegointheloopthroughtheallkernelpagemiddledirectoryentriesandcleanentrieswhicharenotbetween_textandend:
LinuxInside
119Architecture-specificinitializations,again...
for(;pmd<last_pmd;pmd++,vaddr+=PMD_SIZE){
if(pmd_none(*pmd))
continue;
if(vaddr<(unsignedlong)_text||vaddr>end)
set_pmd(pmd,__pmd(0));
}
Afterthiswesetthelimitforthememblockallocationwiththememblock_set_current_limitfunction(readmoreaboutmemblockyoucanintheLinuxkernelmemorymanagementPart2),itwillbeISA_END_ADDRESSor0x100000andfillthememblockinformationaccordingtoe820withthecallofthememblock_x86_fillfunction.Youcanseetheresultofthisfunctioninthekernelinitializationtime:
MEMBLOCKconfiguration:
memorysize=0x1fff7ec00reservedsize=0x1e30000
memory.cnt=0x3
memory[0x0][0x00000000001000-0x0000000009efff],0x9e000bytesflags:0x0
memory[0x1][0x00000000100000-0x000000bffdffff],0xbfee0000bytesflags:0x0
memory[0x2][0x00000100000000-0x0000023fffffff],0x140000000bytesflags:0x0
reserved.cnt=0x3
reserved[0x0][0x0000000009f000-0x000000000fffff],0x61000bytesflags:0x0
reserved[0x1][0x00000001000000-0x00000001a57fff],0xa58000bytesflags:0x0
reserved[0x2][0x0000007ec89000-0x0000007fffffff],0x1377000bytesflags:0x0
Therestfunctionsafterthememblock_x86_fillare:early_reserve_e820_mpc_newalocatesadditionalslotsinthee820mapforMultiProcessorSpecificationtable,reserve_real_mode-reserveslowmemoryfrom0x0to1megabyteforthetrampolinetotherealmode(forrebootinandetc...),trim_platform_memory_ranges-trimscertainmemoryregionsstartedfrom0x20050000,0x20110000andetc...theseregionsmustbeexcludedbecauseSandyBridgehasproblemswiththeseregions,trim_low_memory_rangereservesthefirst4killobytespageinmemblock,init_mem_mappingfunctionreconstructsdirectmemorymappingandsetupsthedirectmappingofthephysicalmemoryatPAGE_OFFSET,early_trap_pf_initsetups#PFhandler(wewilllookonitinthechapteraboutinterrupts)andsetup_real_modefunctionsetupstrampolinetotherealmodecode.
That'sall.Youcannotethatthispartwillnotcoverallfunctionswhichareinthesetup_arch(likeearly_gart_iommu_check,mtrrinitalizationandetc...).AsIalreadywrotemanytimes,setup_archisbig,andlinuxkernelisbig.That'swhyIcan'tcovereverylineinthelinuxkernel.Idon'tthinkthatwemissedsomethingimportant,...butyoucansaysomethinglike:eachlineofcodeisimportant.Yes,it'strue,butImissedtheyanyway,becauseIthinkthatitisnotrealtocoverfulllinuxkernel.Anywaywewilloftenreturntotheideathatwehavealreadyseen,andifsomethingwillbeunfamiliar,wewillcoverthistheme.
Itistheendofthesixthpartaboutlinuxkernelinitializationprocess.Inthispartwecontinuedtodiveinthesetup_archfunctionagainItwaslongpart,butwenotfinishedwithit.Yes,setup_archisbig,hopethatnextpartwillbelastaboutthisfunction.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
MultiProcessorSpecificationNXbit
Conclusion
Links
LinuxInside
120Architecture-specificinitializations,again...
Documentation/kernel-parameters.txtAPICCPUmasksLinuxkernelmemorymanagementPCIe820SystemManagementBIOSSystemManagementBIOSEFISMPMultiProcessorSpecificationBSSSMBIOSspecificationPreviouspart
LinuxInside
121Architecture-specificinitializations,again...
ThisistheseventhparthoftheLinuxKernelinitializationprocesswhichcoversinternalsofthesetup_archfunctionfromthearch/x86/kernel/setup.c.Asyoucanknowfromthepreviousparts,thesetup_archfunctiondoessomearchitecture-specific(inourcaseitisx86_64)initializationstufflikereservingmemoryforkernelcode/data/bss,earlyscanningoftheDesktopManagementInterface,earlydumpofthePCIdeviceandmanymanymore.Ifyouhavereadthepreviouspart,youcanrememberthatwe'vefinisheditatthesetup_real_modefunction.Inthenextstep,aswesetlimitofthememblocktotheallmappedpages,wecanseethecallofthesetup_log_buffunctionfromthekernel/printk/printk.c.
Thesetup_log_buffunctionsetupskernelcyclicbufferwhichlengthdependsontheCONFIG_LOG_BUF_SHIFTconfigurationoption.AswecanreadfromthedocumentationoftheCONFIG_LOG_BUF_SHIFTitcanbebetween12and21.Intheinternals,bufferdefinedasarrayofchars:
#define__LOG_BUF_LEN(1<<CONFIG_LOG_BUF_SHIFT)
staticchar__log_buf[__LOG_BUF_LEN]__aligned(LOG_ALIGN);
staticchar*log_buf=__log_buf;
Nowlet'slookontheimplementationofthsetup_log_buffunction.Itstartswithcheckthatcurrentbufferisempty(Itmustbeempty,becausewejustsetupit)andanothercheckthatitisearlysetup.Ifsetupofthekernellogbufferisnotearly,wecallthelog_buf_add_cpufunctionwhichincreasesizeofthebufferforeveryCPU:
if(log_buf!=__log_buf)
return;
if(!early&&!new_log_buf_len)
log_buf_add_cpu();
Wewillnotresearchlog_buf_add_cpufunction,becauseasyoucanseeinthesetup_arch,wecallsetup_log_bufas:
setup_log_buf(1);
where1meansthatisisearlysetup.Inthenextstepwechecknew_log_buf_lenvariablewhichisupdatedlengthofthekernellogbufferandallocatenewspaceforthebufferwiththememblock_virt_allocfunctionforit,orjustreturn.
Askernellogbufferisready,thenextfunctionisreserve_initrd.Youcanrememberthatwealreadycalledtheearly_reserve_initrdfunctioninthefourthpartoftheKernelinitialization.Now,aswereconstructeddirectmemorymappingintheinit_mem_mappingfunction,weneedtomoveinitrdtothedownintodirectlymappedmemory.Thereserve_initrdfunctionstartsfromthedefinitionofthebaseaddressandendaddressoftheinitrdandcheckthatinitrdwasprovidedbyabootloader.Allthesameaswesawitintheearly_reserve_initrd.Butinsteadofthereservingplaceinthememblockareawiththecallofthememblock_reservefunction,wegetthemappedsizeofthedirectmemoryareaandcheckthatthesizeoftheinitrdisnotgreaterthatthisareawith:
mapped_size=memblock_mem_size(max_pfn_mapped);
if(ramdisk_size>=(mapped_size>>1))
panic("initrdtoolargetohandle,"
"disablinginitrd(%lldneeded,%lldavailable)\n",
Kernelinitialization.Part7.
TheEndofthearchitecture-specificinitializations,almost...
LinuxInside
122Endofthearchitecture-specificinitializations,almost...
ramdisk_size,mapped_size>>1);
Youcanseeherethatwecallmemblock_mem_sizefunctionandpassthemax_pfn_mappedtoit,wheremax_pfn_mappedcontainsthehighestdirectmappedpageframenumber.Ifyoudonotrememberwhatisitpageframenumber,explanationissimple:First12bitsofthevirtualaddressrepresentoffsetinthephysicalpageorpageframe.Ifwewillshiftrightvirtualaddresson12,we'lldiscardoffsetpartandwillgetPageFrameNumber.Inthememblock_mem_sizewegothroughtheallmemblockmem(notreserved)regionsandcalculatessizeofthemappedpagesamountandreturnittothemapped_sizevariable(seecodeabove).Aswegotamountofthedirectmappedmemory,wecheckthatsizeoftheinitrdisnotgreaterthanmappedpages.IfitisgreaterwejustcallpanicwhichhaltsthesystemandprintspopularKernelpanicmessage.Inthenextstepweprintinformationabouttheinitrdsize.Wecanseetheresultofthisinthedmesgoutput:
[0.000000]RAMDISK:[mem0x36d20000-0x37687fff]
andrelocateinitrdtothedirectmappingareawiththerelocate_initrdfunction.Inthestartoftherelocate_initrdfunctionwetrytofindfreeareawiththememblock_find_in_rangefunction:
relocated_ramdisk=memblock_find_in_range(0,PFN_PHYS(max_pfn_mapped),area_size,PAGE_SIZE);
if(!relocated_ramdisk)
panic("CannotfindplacefornewRAMDISKofsize%lld\n",
ramdisk_size);
Thememblock_find_in_rangefunctiontriestofindfreeareainagivenrange,inourcasefrom0tothemaximummappedphysicaladdressandsizemustequaltothealignedsizeoftheinitrd.Ifwedidn'tfindareawiththegivensize,wecallpanicagain.Ifallisgood,westarttorelocatedRAMdisktothedownofthedirectlymappedmeoryinthenextstep.
Intheendofthereserve_initrdfunction,wefreememblockmemorywhichoccupiedbytheramdiskwiththecallofthe:
memblock_free(ramdisk_image,ramdisk_end-ramdisk_image);
Afterwerelocatedinitrdramdiskimage,thenextfunctionisvsmp_initfromthearch/x86/kernel/vsmp_64.c.ThisfunctioninitializessupportoftheScaleMPvSMP.AsIalreadywroteinthepreviousparts,thischapterwillnotcovernon-relatedx86_64initializationparts(forexampleasthecurrentorACPIandetc...).Sowewillmissimplementationofthisfornowandwillbacktoitinthepartwhichwillcovertechniquesofparallelcomputing.
Thenextfunctionisio_delay_initfromthearch/x86/kernel/io_delay.c.ThisfunctionallowstooverridedefaultdefaultI/Odelay0x80port.WealreadysawI/OdelayintheLastpreparationbeforetransitionintoprotectedmode,nowlet'slookontheio_delay_initimplementation:
void__initio_delay_init(void)
{
if(!io_delay_override)
dmi_check_system(io_delay_0xed_port_dmi_table);
}
Thisfunctioncheckio_delay_overridevariableandoverridesI/Odelayportifio_delay_overrideisset.Wecansetio_delay_overridevariablybypassingio_delayoptiontothekernelcommandline.AswecanreadfromtheDocumentation/kernel-parameters.txt,io_delayoptionis:
io_delay=[X86]I/Odelaymethod
0x80
LinuxInside
123Endofthearchitecture-specificinitializations,almost...
Standardport0x80baseddelay
0xed
Alternateport0xedbaseddelay(neededonsomesystems)
udelay
Simpletwomicrosecondsdelay
none
Nodelay
Wecanseeio_delaycommandlineparametersetupwiththeearly_parammacrointhearch/x86/kernel/io_delay.c
early_param("io_delay",io_delay_param);
Moreaboutearly_paramyoucanreadinthepreviouspart.Sotheio_delay_paramfunctionwhichsetupsio_delay_overridevariablewillbecalledinthedo_early_paramfunction.io_delay_paramfunctiongetstheargumentoftheio_delaykernelcommandlineparameterandsetsio_delay_typedependsonit:
staticint__initio_delay_param(char*s)
{
if(!s)
return-EINVAL;
if(!strcmp(s,"0x80"))
io_delay_type=CONFIG_IO_DELAY_TYPE_0X80;
elseif(!strcmp(s,"0xed"))
io_delay_type=CONFIG_IO_DELAY_TYPE_0XED;
elseif(!strcmp(s,"udelay"))
io_delay_type=CONFIG_IO_DELAY_TYPE_UDELAY;
elseif(!strcmp(s,"none"))
io_delay_type=CONFIG_IO_DELAY_TYPE_NONE;
else
return-EINVAL;
io_delay_override=1;
return0;
}
Thenextfunctionsareacpi_boot_table_init,early_acpi_boot_initandinitmem_initaftertheio_delay_init,butasIwroteabovewewillnotcoverACPIrelatedstuffinthisLinuxKernelinitializationprocesschapter.
InthenextstepweneedtoallocateareafortheDirectmemoryaccesswiththedma_contiguous_reservefunctionwhichdefinedinthedrivers/base/dma-contiguous.c.DMAareaisaspecialmodewhendevicescomminicatewithmemorywithoutCPU.Notethatwepassoneparameter-max_pfn_mapped<<PAGE_SHIFT,tothedma_contiguous_reservefunctionandasyoucanunderstandfromthisexpression,thisislimitofthereservedmemory.Let'slookontheimplementationofthisfunction.Itstartsfromthedefinitionofthefollowingvariables:
phys_addr_tselected_size=0;
phys_addr_tselected_base=0;
phys_addr_tselected_limit=limit;
boolfixed=false;
wherefirstrepresentssizeinbytesofthereservedarea,secondisbaseaddressofthereservedarea,thirdisendaddressofthereservedareaandthelastfixedparametershowswheretoplacereservedarea.Iffixedis1wejustreserveareawiththememblock_reserve,ifitis0weallocatespacewiththekmemleak_alloc.Inthenextstepwechecksize_cmdlinevariableandifitisnotequalto-1wefillallvariableswhichyoucanseeabovewiththevaluesfromthecmakernelcommandlineparameter:
AllocateareaforDMA
LinuxInside
124Endofthearchitecture-specificinitializations,almost...
if(size_cmdline!=-1){
...
...
...
}
Youcanfindinthissourcecodefiledefinitionoftheearlyparameter:
early_param("cma",early_cma);
wherecmais:
cma=nn[MG]@[start[MG][-end[MG]]]
[ARM,X86,KNL]
Setsthesizeofkernelglobalmemoryareafor
contiguousmemoryallocationsandoptionallythe
placementconstraintbythephysicaladdressrangeof
memoryallocations.Avalueof0disablesCMA
altogether.Formoreinformation,see
include/linux/dma-contiguous.h
Ifwewillnotpasscmaoptiontothekernelcommandline,size_cmdlinewillbeequalto-1.Inthiswayweneedtocalculatesizeofthereservedareawhichdependsonthefollowingkernelconfigurationoptions:
CONFIG_CMA_SIZE_SEL_MBYTES-sizeinmegabytes,defaultglobalCMAarea,whichisequaltoCMA_SIZE_MBYTES*SZ_1MorCONFIG_CMA_SIZE_MBYTES*1M;CONFIG_CMA_SIZE_SEL_PERCENTAGE-percentageoftotalmemory;CONFIG_CMA_SIZE_SEL_MIN-uselowervalue;CONFIG_CMA_SIZE_SEL_MAX-usehighervalue.
Aswecalculatedthesizeofthereservedarea,wereserveareawiththecallofthedma_contiguous_reserve_areafunctionwhichfirstofallcalls:
ret=cma_declare_contiguous(base,size,limit,0,0,fixed,res_cma);
function.Thecma_declare_contiguousreservescontiguousareafromthegivenbaseaddressandwithgivensize.AfterwereservedareafortheDMA,nextfunctionisthememblock_find_dma_reserve.Asyoucanunderstandfromitsname,thisfunctioncountsthereservedpagesintheDMAarea.ThispartwillnotcoveralldetailsoftheCMAandDMA,becausetheyarebig.WewillseemuchmoredetailsinthespecialpartintheLinuxKernelMemorymanagementwhichcoverscontiguousmemoryallocatorsandareas.
Thenextstepisthecallofthefunction-x86_init.paging.pagetable_init.Ifyouwilltrytofindthisfunctioninthelinuxkernelsourcecode,intheendofyoursearch,youwillseethefollowingmacro:
#definenative_pagetable_initpaging_init
whichexpandsasyoucanseetothecallofthepaging_initfunctionfromthearch/x86/mm/init_64.c.Thepaging_initfunctioninitializessparsememoryandzonesizes.Firstofallwhat'szonesandwhatisitSparsemem.TheSparsememisaspecialfoundationinthelinuxkernenmemorymanagerwhichusedtosplitmemoryareatothedifferentmemorybanksin
Initializationofthesparsememory
LinuxInside
125Endofthearchitecture-specificinitializations,almost...
theNUMAsystems.Let'slookontheimplementationofthepaginig_initfunction:
void__initpaging_init(void)
{
sparse_memory_present_with_active_regions(MAX_NUMNODES);
sparse_init();
node_clear_state(0,N_MEMORY);
if(N_MEMORY!=N_NORMAL_MEMORY)
node_clear_state(0,N_NORMAL_MEMORY);
zone_sizes_init();
}
Asyoucanseethereiscallofthesparse_memory_present_with_active_regionsfunctionwhichrecordsamemoryareaforeveryNUMAnodetothearrayofthemem_sectionstructurewhichcontainsapointertothestructureofthearrayofstructpage.Thenextsparse_initfunctionallocatesnon-linearmem_sectionandmem_map.Inthenextstepweclearstateofthemovablememorynodesandinitializesizesofzones.EveryNUMAnodeisdevidedintoanumberofpieceswhicharecalled-zones.So,zone_sizes_initfunctionfromthearch/x86/mm/init.cinitializessizeofzones.
Again,thispartandnextpartsdonotcoverthisthemeinfulldetails.TherewillbespecialpartaboutNUMA.
ThenextstepafterSparseMeminitializationissettingofthetrampoline_cr4_featureswhichmustcontaincontentofthecr4Controlregister.FirstofallweneedtocheckthatcurrentCPUhassupportofthecr4registerandifithas,wesaveitscontenttothetrampoline_cr4_featureswhichisstorageforcr4intherealmode:
if(boot_cpu_data.cpuid_level>=0){
mmu_cr4_features=__read_cr4();
if(trampoline_cr4_features)
*trampoline_cr4_features=mmu_cr4_features;
}
Thenextfunctionwhichyoucanseeismap_vsyscalfromthearch/x86/kernel/vsyscall_64.c.ThisfunctionmapsmemoryspaceforvsyscallsanddependsonCONFIG_X86_VSYSCALL_EMULATIONkernelconfigurationoption.Actuallyvsyscallisaspecialsegmentwhichprovidesfastaccesstothecertainsystemcallslikegetcpuandetc...Let'slookonimplementationofthisfunction:
void__initmap_vsyscall(void)
{
externchar__vsyscall_page;
unsignedlongphysaddr_vsyscall=__pa_symbol(&__vsyscall_page);
if(vsyscall_mode!=NONE)
__set_fixmap(VSYSCALL_PAGE,physaddr_vsyscall,
vsyscall_mode==NATIVE
?PAGE_KERNEL_VSYSCALL
:PAGE_KERNEL_VVAR);
BUILD_BUG_ON((unsignedlong)__fix_to_virt(VSYSCALL_PAGE)!=
(unsignedlong)VSYSCALL_ADDR);
}
Inthebeginningofthemap_vsyscalwecanseedefinitionoftwovariables.Thefirstisexternvalirable__vsyscall_page.Asvariableextern,itdefinedsomewhereinothersourcecodefile.Actuallywecanseedefinitionofthe__vsyscall_pageinthearch/x86/kernel/vsyscall_emu_64.S.The__vsyscall_pagesymbolpointstothealignedcallsofthevsyscallsasgettimeofdayandetc...:
vsyscallmapping
LinuxInside
126Endofthearchitecture-specificinitializations,almost...
.globl__vsyscall_page
.balignPAGE_SIZE,0xcc
.type__vsyscall_page,@object
__vsyscall_page:
mov$__NR_gettimeofday,%rax
syscall
ret
.balign1024,0xcc
mov$__NR_time,%rax
syscall
ret
...
...
...
Thesecondvariableisphysaddr_vsyscallwhichjuststoresphysicaladdressofthe__vsyscall_pagesymbol.Inthenextstepwecheckthevsyscall_modevariable,andifitisnotequaltoNONEwhichisEMULATEbydefault:
staticenum{EMULATE,NATIVE,NONE}vsyscall_mode=EMULATE;
Andafterthischeckwecanseethecallofthe__set_fixmapfunctionwhichcallsnative_set_fixmapwiththesameparameters:
voidnative_set_fixmap(enumfixed_addressesidx,unsignedlongphys,pgprot_tflags)
{
__native_set_fixmap(idx,pfn_pte(phys>>PAGE_SHIFT,flags));
}
void__native_set_fixmap(enumfixed_addressesidx,pte_tpte)
{
unsignedlongaddress=__fix_to_virt(idx);
if(idx>=__end_of_fixed_addresses){
BUG();
return;
}
set_pte_vaddr(address,pte);
fixmaps_set++;
}
Herewecanseethatnative_set_fixmapmakesvalueofPageTableEntryfromthegivenphysicaladdress(physicaladdressofthe__vsyscall_pagesymbolinourcase)andcallsinternalfunction-__native_set_fixmap.Internalfunctiongetsthevirtualaddressofthegivenfixed_addressesindex(VSYSCALL_PAGEinourcase)andchecksthatgivenindexisnotgreatedthanendofthefix-mappedaddresses.Afterthiswesetpagetableentrywiththecalloftheset_pte_vaddrfunctionandincreasecountofthefix-mappedaddresses.Andintheendofthemap_vsyscallwecheckthatvirtualaddressoftheVSYSCALL_PAGE(whichisfirstindexinthefixed_addresses)isnotgreaterthanVSYSCALL_ADDRwhichis-10UL<<20orffffffffff600000withtheBUILD_BUG_ONmacro:
BUILD_BUG_ON((unsignedlong)__fix_to_virt(VSYSCALL_PAGE)!=
(unsignedlong)VSYSCALL_ADDR);
Nowvsyscallareaisinthefix-mappedarea.That'sallaboutmap_vsyscall,ifyoudonotknowanythingaboutfix-mappedaddresses,youcanreadFix-MappedAddressesandioremap.Moreaboutvsyscallswewillseeinthevsyscallsandvdsopart.
GettingtheSMPconfiguration
LinuxInside
127Endofthearchitecture-specificinitializations,almost...
YoucanrememberhowwemadeasearchoftheSMPconfigurationinthepreviouspart.NowweneedtogettheSMPconfigurtaionifwefoundit.Forthiswechecksmp_found_configvariablewhichwesetinthesmp_scan_configfunction(readaboutitthepreviouspart)andcalltheget_smp_configfunction:
if(smp_found_config)
get_smp_config();
Theget_smp_configexpandstothex86_init.mpparse.default_get_smp_configfunctionwhichdefinedinthearch/x86/kernel/mpparse.c.Thisfunctiondefinespointertothemultiprocessorfloatingpointerstructure-mpf_intel(youcanreadaboutitinthepreviouspart)anddoessomechecks:
structmpf_intel*mpf=mpf_found;
if(!mpf)
return;
if(acpi_lapic&&early)
return;
Herewecanseethatmultiprocessorconfigurationwasfoundinthesmp_scan_configfunctionorjustreturnfromthefunctionifnot.Thenextcheckcheckthatitisearly.Andaswedidthischecks,westarttoreadtheSMPconfiguration.Aswefinishedtoreadit,thenextstepis-prefill_possible_mapfunctionwhichmakespreliminaryfillingofthepossibleCPUscpumask(moreaboutityoucanreadintheIntroductiontothecpumasks).
Herewearegettingtotheendofthesetup_archfunction.Therestfunctionofcoursemakeimportantstuff,butdetailsaboutthesestuffwillnotwillnotbeincludedinthispart.Wewilljusttakeashortlookonthesefunctions,becausealthoughtheyareimportantasIwroteabove,buttheycovernon-generickernelfeaturesrelatedwiththeNUMA,SMP,ACPIandAPICsandetc...Firstofall,thenextcalloftheinit_apic_mappingsfunction.AswecanunderstandthisfunctionsetstheaddressofthelocalAPIC.Thenextisx86_io_apic_ops.initandthisfunctioninitializesI/OAPIC.PleasenotethatalldetailsrelatedwithAPIC,wewillseeinthechapteraboutinterruptsandexceptionshandling.InthenextstepwereservestandardI/OresourceslikeDMA,TIMER,FPUandetc...,withthecallofthex86_init.resources.reserve_resourcesfunction.Followingismcheck_initfunctioninitializesMachinecheckExceptionandthelastisregister_refined_jiffieswhichregistersjiffy(Therewillbeseparatechapterabouttimersinthekernel).
Sothat'sall.Finallywehavefinishedwiththebigsetup_archfunctioninthispart.OfcourseasIalreadywrotemanytimes,wedidnotseefulldetailsaboutthisfunction,butdonotworryaboutit.Wewillbebackmorethanoncetothisfunctionfromdifferentchaptersforunderstandinghowdifferentplatform-dependentpartsareinitialized.
That'sall,andnowwecanbacktothestart_kernelfromthesetup_arch.
AsIwroteabove,wehavefinishedwiththesetup_archfunctionandnowwecanbacktothestart_kernelfunctionfromtheinit/main.c.Asyoucanrememberorevenyousawyourself,start_kernelfunctionisverybigtooasthesetup_arch.Sothecoupleofthenextpartwillbededicatedtothelearningofthisfunction.So,let'scontinuewithit.Afterthesetup_archwecanseethecallofthemm_init_cpumaskfunction.Thisfunctionsetsthecpumask)pointertothememorydescriptorcpumask.Wecanlookonitsimplementation:
Therestofthesetup_arch
Backtothemain.c
LinuxInside
128Endofthearchitecture-specificinitializations,almost...
staticinlinevoidmm_init_cpumask(structmm_struct*mm)
{
#ifdefCONFIG_CPUMASK_OFFSTACK
mm->cpu_vm_mask_var=&mm->cpumask_allocation;
#endif
cpumask_clear(mm->cpu_vm_mask_var);
}
Asyoucanseeintheinit/main.c,wepassedmemorydescriptoroftheinitprocesstothemm_init_cpumaskandheredependonCONFIG_CPUMASK_OFFSTACKconfigurationoptionwesetorclearTLBswitchcpumask.
Inthenextstepwecanseethecallofthefollowingfunction:
setup_command_line(command_line);
Thisfunctiontakespointertothekernelcommandlineallocatesacoupleofbufferstostorecommandline.Weneedacoupleofbuffers,becauseonebufferusedforfuturereferenceandaccessingtocommandlineandoneforparameterparsing.Wewillallocatespaceforthefollowingbuffers:
saved_command_line-willcontainbootcommandline;initcall_command_line-willcontainbootcommandline.willbeusedinthedo_initcall_level;static_command_line-willcontaincommandlineforparametersparsing.
Wewillallocatespacewiththememblock_virt_allocfunction.Thisfunctioncallsmemblock_virt_alloc_try_nidwhichallocatesbootmemoryblockwithmemblock_reserveifslabisnotavailableoruseskzalloc_node(moreaboutitwillbeinthelinuxmemorymanagementchapter).Thememblock_virt_allocusesBOOTMEM_LOW_LIMIT(physicalladdressofthe(PAGE_OFFSET+0x1000000)value)andBOOTMEM_ALLOC_ACCESSIBLE(equaltothecurrentvalueofthememblock.current_limit)asminimumaddressofthememoryegionandmaximumaddressofthememoryregion.
Let'slookontheimplementationofthesetup_command_line:
staticvoid__initsetup_command_line(char*command_line)
{
saved_command_line=
memblock_virt_alloc(strlen(boot_command_line)+1,0);
initcall_command_line=
memblock_virt_alloc(strlen(boot_command_line)+1,0);
static_command_line=memblock_virt_alloc(strlen(command_line)+1,0);
strcpy(saved_command_line,boot_command_line);
strcpy(static_command_line,command_line);
}
Herewecanseethatweallocatespaceforthethreebufferswhichwillcontainkernelcommandlineforthedifferentpurposes(readabove).Andasweallocatedspace,westoringboot_comand_lineinthesaved_command_lineandcommand_line(kernelcommandlinefromthesetup_archtothestatic_command_line).
Thenextfunctionafterthesetup_command_lineisthesetup_nr_cpu_ids.Thisfunctionsettingnr_cpu_ids(numberofCPUs)accordingtothelastbitinthecpu_possible_mask(moreaboutityoucanreadinthechapterdescribescpumasksconcept).Let'slookonitsimplementation:
void__initsetup_nr_cpu_ids(void)
{
nr_cpu_ids=find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS)+1;
}
LinuxInside
129Endofthearchitecture-specificinitializations,almost...
Herenr_cpu_idsrepresentsnumberofCPUs,NR_CPUSrepresentsthemaximumnumberofCPUswhichwecansetinconfigurationtime:
Actuallyweneedtocallthisfunction,becauseNR_CPUScanbegreaterthanactualamountoftheCPUsintheyourcomputer.Herewecanseethatwecallfind_last_bitfunctionandpasstwoparameterstoit:
cpu_possible_maskbits;maximimnumberofCPUS.
Inthesetup_archwecanfindthecalloftheprefill_possible_mapfunctionwhichcalculatesandwritestothecpu_possible_maskactualnumberoftheCPUs.Wecallthefind_last_bitfunctionwhichtakestheaddressandmaximumsizetosearchandreturnsbitnumberofthefirstsetbit.Wepassedcpu_possible_maskbitsandmaximumnumberoftheCPUs.Firstofallthefind_last_bitfunctionsplitsgivenunsignedlongaddresstothewords:
words=size/BITS_PER_LONG;
whereBITS_PER_LONGis64onthex86_64.Aswegotamountofwordsinthegivensizeofthesearchdata,weneedtocheckisgivensizedoesnotcontainpartialwordswiththefollowingcheck:
if(size&(BITS_PER_LONG-1)){
tmp=(addr[words]&(~0UL>>(BITS_PER_LONG
-(size&(BITS_PER_LONG-1)))));
if(tmp)
gotofound;
}
ifitcontainspartialword,wemaskthelastwordandcheckit.Ifthelastwordisnotzero,itmeansthatcurrentwordcontainsatleastonesetbit.Wegotothefoundlabel:
LinuxInside
130Endofthearchitecture-specificinitializations,almost...
found:
returnwords*BITS_PER_LONG+__fls(tmp);
Hereyoucansee__flsfunctionwhichreturnslastsetbitinagivenwordwithhelpofthebsrinstruction:
staticinlineunsignedlong__fls(unsignedlongword)
{
asm("bsr%1,%0"
:"=r"(word)
:"rm"(word));
returnword;
}
Thebsrinstructionwhichscansthegivenoperandforfirstbitset.Ifthelastwordisnotpartialwegoingthroughtheallwordsinthegivenaddressandtryingtofindfirstsetbit:
while(words){
tmp=addr[--words];
if(tmp){
found:
returnwords*BITS_PER_LONG+__fls(tmp);
}
}
Hereweputthelastwordtothetmpvariableandcheckthattmpcontainsatleastonesetbit.Ifasetbitfound,wereturnthenumberofthisbit.Ifnoonewordsdonotcontainssetbitwejustreturngivensize:
returnsize;
Afterthisnr_cpu_idswillcontainthecorrectamountoftheavaliableCPUs.
That'sall.
Itistheendoftheseventhpartaboutthelinuxkernelinitializationprocess.Inthispart,finallywehavefinsihedwiththesetup_archfunctionandreturnedtothestart_kernelfunction.Inthenextpartwewillcontinuetolearngenerickernelcodefromthestart_kernelandwillcontinueourwaytothefirstinitprocess.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
DesktopManagementInterfacex86_64initrdKernelpanicDocumentation/kernel-parameters.txtACPI
Conclusion
Links
LinuxInside
131Endofthearchitecture-specificinitializations,almost...
DirectmemoryaccessNUMAControlregistervsyscallsSMPjiffyPreviouspart
LinuxInside
132Endofthearchitecture-specificinitializations,almost...
ThisistheeighthpartoftheLinuxkernelinitializationprocessandwestoppedonthesetup_nr_cpu_idsfunctioninthepreviouspart.Themainpointofthecurrentpartisschedulerinitialization.Butbeforewewillstarttolearninitializationprocessofthescheduler,weneedtodosomestuff.Thenextstepintheinit/main.cisthesetup_per_cpu_areasfunction.Thisfunctionsetupsareasforthepercpuvariables,moreaboutityoucanreadinthespecialpartaboutthePer-CPUvariables.Afterpercpuareasupandrunning,thenextstepisthesmp_prepare_boot_cpufunction.ThisfunctiondoessomepreparationsfortheSMP:
staticinlinevoidsmp_prepare_boot_cpu(void)
{
smp_ops.smp_prepare_boot_cpu();
}
wherethesmp_prepare_boot_cpuexpandstothecallofthenative_smp_prepare_boot_cpufunction(moreaboutsmp_opswillbeinthespecialpartsaboutSMP):
void__initnative_smp_prepare_boot_cpu(void)
{
intme=smp_processor_id();
switch_to_new_gdt(me);
cpumask_set_cpu(me,cpu_callout_mask);
per_cpu(cpu_state,me)=CPU_ONLINE;
}
Thenative_smp_prepare_boot_cpufunctiongetsthenumberofthecurrentCPU(whichisBootstrapprocessoranditsidiszero)withthesmp_processor_idfunction.Iwillnotexplainhowthesmp_processor_idworks,becausewealreadsawitintheKernelentrypointpart.AswegotprocessoridnumberwereloadGlobalDescriptorTableforthegivenCPUwiththeswitch_to_new_gdtfunction:
voidswitch_to_new_gdt(intcpu)
{
structdesc_ptrgdt_descr;
gdt_descr.address=(long)get_cpu_gdt_table(cpu);
gdt_descr.size=GDT_SIZE-1;
load_gdt(&gdt_descr);
load_percpu_segment(cpu);
}
Thegdt_descrvariablerepresentspointertotheGDTdescriptorhere(wealreadysawdesc_ptrintheEarlyinterruptandexceptionhandling).WegettheaddressandthesizeoftheGDTdescriptorwhereGDT_SIZEis256or:
#defineGDT_SIZE(GDT_ENTRIES*8)
andtheaddressofthedescriptorwewillgetwiththeget_cpu_gdt_table:
staticinlinestructdesc_struct*get_cpu_gdt_table(unsignedintcpu)
{
Kernelinitialization.Part8.
Schedulerinitialization
LinuxInside
133Schedulerinitialization
returnper_cpu(gdt_page,cpu).gdt;
}
Theget_cpu_gdt_tableusesper_cpumacroforgettinggdt_pagepercpuvariableforthegivenCPUnumber(bootstrapprocessorwithid-0inourcase).Youcanaskthefollowingquestion:so,ifwecanaccessgdt_pagepercpuvariable,whereitwasdefined?Actuallywealreadsawitinthisbook.Ifyouhavereadthefirstpartofthischapter,youcanrememberthatwesawdefinitionofthegdt_pageinthearch/x86/kernel/head_64.S:
early_gdt_descr:
.wordGDT_ENTRIES*8-1
early_gdt_descr_base:
.quadINIT_PER_CPU_VAR(gdt_page)
andifwewilllookonthelinkerfilewecanseethatitlocatesafterthe__per_cpu_loadsymbol:
#defineINIT_PER_CPU(x)init_per_cpu__##x=x+__per_cpu_load
INIT_PER_CPU(gdt_page);
andfilledgdt_pageinthearch/x86/kernel/cpu/common.c:
DEFINE_PER_CPU_PAGE_ALIGNED(structgdt_page,gdt_page)={.gdt={
#ifdefCONFIG_X86_64
[GDT_ENTRY_KERNEL32_CS]=GDT_ENTRY_INIT(0xc09b,0,0xfffff),
[GDT_ENTRY_KERNEL_CS]=GDT_ENTRY_INIT(0xa09b,0,0xfffff),
[GDT_ENTRY_KERNEL_DS]=GDT_ENTRY_INIT(0xc093,0,0xfffff),
[GDT_ENTRY_DEFAULT_USER32_CS]=GDT_ENTRY_INIT(0xc0fb,0,0xfffff),
[GDT_ENTRY_DEFAULT_USER_DS]=GDT_ENTRY_INIT(0xc0f3,0,0xfffff),
[GDT_ENTRY_DEFAULT_USER_CS]=GDT_ENTRY_INIT(0xa0fb,0,0xfffff),
...
...
...
moreaboutpercpuvariablesyoucanreadinthePer-CPUvariablespart.AswegotaddressandsizeoftheGDTdescriptorwecasereloadGDTwiththeload_gdtwhichjustexecutelgdtinstructandloadpercpu_segmentwiththefollowingfunction:
voidload_percpu_segment(intcpu){
loadsegment(gs,0);
wrmsrl(MSR_GS_BASE,(unsignedlong)per_cpu(irq_stack_union.gs_base,cpu));
load_stack_canary_segment();
}
Thebaseaddressofthepercpuareamustcontaingsregister(orfsregisterforx86),soweareusingloadsegmentmacroandpassgs.InthenextstepwewritesthebaseaddressiftheIRQstackandsetupstackcanary(thisisonlyforx86_32).AfterweloadnewGDT,wefillcpu_callout_maskbitmapwiththecurrentcpuandsetcpustateasonlinewiththesettingcpu_statepercpuvariableforthecurrentprocessor-CPU_ONLINE:
cpumask_set_cpu(me,cpu_callout_mask);
per_cpu(cpu_state,me)=CPU_ONLINE;
So,whatisitcpu_callout_maskbitmap...Asweinitializedbootstrapprocessor(procesoorwhichisbootedthefirstonx86)theotherprocessorsinamultiprocessorsystemareknownassecondaryprocessors.Linuxkernelusestwofollowingbitmasks:
LinuxInside
134Schedulerinitialization
cpu_callout_mask
cpu_callin_mask
Afterbootstrapprocessorinitialized,itupdatesthecpu_callout_masktoindicatewhichsecondaryprocessorcanbeinitializednext.Allotherorsecondaryprocessorscandosomeinitializationstuffbeforeandcheckthecpu_callout_maskontheboostrapprocessorbit.Onlyafterthebootstrapprocessorfilledthecpu_callout_maskthissecondaryprocessor,itwillcontinuetherestofitsinitialization.Afterthatthecertainprocessorwillfinishitsinitializationprocess,theprocessorsetsbitinthecpu_callin_mask.Oncethebootstrapprocessorfindsthebitinthecpu_callin_maskforthecurrentsecondaryprocessor,thisprocessorrepeatsthesameprocedureforinitializationoftherestofasecondaryprocessors.Inashortwordsitworksasidescribed,butmoredetailswewillseeinthechapteraboutSMP.
That'sall.WedidallSMPbootpreparation.
Inthenextstepwecanseethecallofthebuild_all_zonelistsfunction.Thisfunctionsetsuptheorderofzonesthatallocationsarepreferredfrom.Whatarezonesandwhat'sorderwewillunderstandnow.Forthestartlet'sseehowlinuxkernelconsidersphysicalmemory.Physicalmemorymaybearrangedintobankswhicharecalled-nodes.IfyouhasnohardwarewithsupportforNUMA,youwillseeonlyonenode:
$cat/sys/devices/system/node/node0/numastat
numa_hit72452442
numa_miss0
numa_foreign0
interleave_hit12925
local_node72452442
other_node0
Everynodepresentedbythestructpglistdatainthelinuxkernel.Eachnodedevidedintoanumberofspecialblockswhicharecalled-zones.Everyzonepresentedbythezonestructinthelinuxkernelandhasoneofthetype:
ZONE_DMA-0-16M;ZONE_DMA32-usedfor32bitdevicesthatcanonlydoDMAareasbelow4G;ZONE_NORMAL-allRAMfromthe4GBonthex86_64;ZONE_HIGHMEM-absentonthex86_64;ZONE_MOVABLE-zonewhichcontainsmovablepages.
whicharepresentedbythezone_typeenum.Informationaboutzoneswecangetwiththe:
$cat/proc/zoneinfo
Node0,zoneDMA
pagesfree3975
min3
low3
...
...
Node0,zoneDMA32
pagesfree694163
min875
low1093
...
...
Node0,zoneNormal
pagesfree2529995
min3146
low3932
...
...
Buildzonelists
LinuxInside
135Schedulerinitialization
AsIwroteaboveallnodesaredescribedwiththepglist_dataorpg_data_tstructureinmemory.Thisstructuredefinedintheinclude/linux/mmzone.h.Thebuild_all_zonelistsfunctionfromthemm/page_alloc.cconstructsanorderedzonelist(ofdifferentzonesDMA,DMA32,NORMAL,HIGH_MEMORY,MOVABLE)whichspecifiesthezones/nodestovisitwhenaselectedzoneornodecannotsatisfytheallocationrequest.That'sall.MoreaboutNUMAandmultiprocessorsystemswillbeinthespecialpart.
Beforewewillstarttodiveintolinuxkernelschedulerinitializationprocesswemusttodoacoupleofthings.Thefisrtthingisthepage_alloc_initfunctionfromthemm/page_alloc.c.Thisfunctionlooksprettyeasy:
void__initpage_alloc_init(void)
{
hotcpu_notifier(page_alloc_cpu_notify,0);
}
andinitializeshandlerfortheCPUhotplug.Ofcoursethehotcpu_notifierdependsontheCONFIG_HOTPLUG_CPUconfigurationoptionandifthisoptionisset,itjustcallscpu_notifiermacrowhichexpandstothecalloftheregister_cpu_notifierwhichaddshotplugcpuhandler(page_alloc_cpu_notifyinourcase).
Afterthiswecanseethekernelcommandlineintheinitializationoutput:
Andacoupleoffunctionsasparse_early_paramandparse_argswhicharehandleslinuxkernelcommandline.Youcanrememberthatwealreadysawthecalloftheparse_early_paramfunctioninthesixthpartofthekernelinitializationchapter,sowhywecallitagain?Answerissimple:wecallthisfunctioninthearchitecture-specificcode(x86_64inourcase),butnotallarchitecturecallsthisfunction.Andweneedinthecallofthesecondfunctionparse_argstoparseandhandlenon-earlycommandlinearguments.
Inthenextstepwecanseethecallofthejump_label_initfromthekernel/jump_label.c.andinitializesjumplabel.
Afterthiswecanseethecallofthesetup_log_buffunctionwhichsetupstheprintklogbuffer.Wealreadysawthisfunctionintheseventhpartofthelinuxkernelinitializationprocesschapter.
Thenextispidhash_initfunction.Asyouknowaneachprocesshasassigneduniquenumberwhichcalled-processidentificationnumberorPID.EachprocessgeneratedwithforkorcloneisautomaticallyassignedanewuniquePIDvaluebythekernel.ThemanagementofPIDscenteredaroundthetwospecialdatastructures:structpidandstructupid.FirststructurerepresentsinformationaboutaPIDinthekernel.Thesecondstructurerepresentstheinformationthatisvisibleinaspecificnamespace.AllPIDinstancesstoredinthespecialhashtable:
staticstructhlist_head*pid_hash;
ThishashtableisusedtofindthepidinstancethatbelongstoanumericPIDvalue.So,pidhash_initinitializesthishash.Inthestartofthepidhash_initfunctionwecanseethecallofthealloc_large_system_hash:
pid_hash=alloc_large_system_hash("PID",sizeof(*pid_hash),0,18,
HASH_EARLY|HASH_SMALL,
&pidhash_shift,NULL,
Therestofthestuffbeforeschedulerinitialization
PIDhashinitialization
LinuxInside
136Schedulerinitialization
0,4096);
Thenumberofelementsofthepid_hashdependsontheRAMconfiguration,butitcanbebetween2^4and2^12.Thepidhash_initcomputesthesizeandallocatestherequiredstorage(whichishlistinourcase-thesameasdoublylinkedlist,butcontainsonepointerinsteadonthestructhlist_head].Thealloc_large_system_hashfunctionallocatesalargesystemhashtablewithmemblock_virt_alloc_nopanicifwepassHASH_EARLYflag(asitinourcase)orwith__vmallocifwedidnopassthisflag.
Theresultwecanseeinthedmesgoutput:
$dmesg|grephash
[0.000000]PIDhashtableentries:4096(order:3,32768bytes)
...
...
...
That'sall.Therestofthestuffbeforeschedulerinitializationisthefollowingfunctions:vfs_caches_init_earlydoesearlyinitializationofthevirtualfilesystem(moreaboutitwillbeinthechapterwhichwilldescribevirtualfilesystem),sort_main_extablesortsthekernel'sbuilt-inexceptiontableentrieswhicharebetween__start___ex_tableand__stop___ex_table,,andtrap_initinitializiestraphandlers(moreaaboutlasttwofunctionwewillknowintheseparatechapteraboutinterrupts).
Thelaststepbeforetheschedulerinitializationisinitializationofthememorymanagerwiththemm_initfunctionfromtheinit/main.c.Aswecansee,themm_initfunctioninitializesdifferentpartofthelinuxkernelmemorymanager:
page_ext_init_flatmem();
mem_init();
kmem_cache_init();
percpu_init_late();
pgtable_init();
vmalloc_init();
Thefirstispage_ext_init_flatmemdependsontheCONFIG_SPARSEMEMkernelconfigurationoptionandinitializesextendeddataperpagehandling.Themem_initreleasesallbootmem,thekmem_cache_initinitializeskernelcache,thepercpu_init_late-replacespercpuchunkswiththoseallocatedbyslub,thepgtable_init-initilizesthevmalloc_init-initializesvmalloc.Please,NOTEthatwewillnotdiveintodetailsaboutallofthesefunctionsandconcepts,butwewillseealloftheyitintheLinuxkernemmemorymanagerchapter.
That'sall.Nowwecanlookonthescheduler.
Andnowwecametothemainpurposeofthispart-initializationofthetaskscheduler.IwanttosayagainasIdiditalreadymanytimes,youwillnotseethefullexplanationoftheschedulerhere,therewillbespecialchapteraboutthis.Ok,nextpointisthesched_initfunctionfromthekernel/sched/core.candaswecanunderstandfromthefunction'sname,itinitializesscheduler.Let'sstarttodiveinthisfunctionandtrytounderstandhowtheschedulerinitialized.Atthestartofthesched_initfunctionwecanseethefollowingcode:
#ifdefCONFIG_FAIR_GROUP_SCHED
alloc_size+=2*nr_cpu_ids*sizeof(void**);
#endif
#ifdefCONFIG_RT_GROUP_SCHED
alloc_size+=2*nr_cpu_ids*sizeof(void**);
#endif
Schedulerinitialization
LinuxInside
137Schedulerinitialization
Firstofallwecanseetwoconfigurationoptionshere:
CONFIG_FAIR_GROUP_SCHED
CONFIG_RT_GROUP_SCHED
Bothofthisoptionsprovidetwodifferentplanningmodels.Aswecanreadfromthedocumentation,thecurrentscheduler-CFSorCompletelyFairSchedulerusedasimpleconcept.Itmodelsprocessschedulingasifthesystemhadanidealmultitaskingprocessorwhereeachprocesswouldreceive1/nprocessortime,wherenisthenumberoftherunnableprocesses.Theschedulerusesthespecialsetofrulesused.Theserulesdeterminewhenandhowtoselectanewprocesstorunandtheyarecalledschedulingpolicy.TheCompletelyFairSchedulersupportsfollowingnormalornon-real-timeschedulingpolicies:SCHED_NORMAL,SCHED_BATCHandSCHED_IDLE.TheSCHED_NORMALisusedforthemostnormalapplications,theamountofcpueachprocessconsumesismostlydeterminedbythenicevalue,theSCHED_BATCHusedforthe100%non-interactivetasksandtheSCHED_IDLErunstasksonlywhentheprocessorhasnottorunanythingbesidesthistask.Thereal-timepoliciesarealsosupportedforthetime-critialapplications:SCHED_FIFOandSCHED_RR.Ifyou'vereadsomethingabouttheLinuxkernelscheduler,youcanknowthatitismodular.Itmeansthatitsupportsdifferentalgorithmstoscheduledifferenttypesofprocesses.Usuallythismodularityiscalledschedulerclasses.Thesemodulesencapsulateschedulingpolicydetailsandarehandledbytheschedulercorewithoutthecorecodeassumingtoomuchaboutthem.
Nowlet'sbacktotheourcodeandlookonthetwoconfigurationoptionsCONFIG_FAIR_GROUP_SCHEDandCONFIG_RT_GROUP_SCHED.Thescheduleroperatesonanindividualtask.Theseoptionsallowstoschedulegrouptasks(moreaboutityoucanreadintheCFSgroupscheduling).Wecanseethatweassignthealloc_sizevariableswhichrepresentsizebasedonamountoftheprocessorstoallocateforthesched_entityandcfs_rqtothe2*nr_cpu_ids*sizeof(void**)expressionwithkzalloc:
ptr=(unsignedlong)kzalloc(alloc_size,GFP_NOWAIT);
#ifdefCONFIG_FAIR_GROUP_SCHED
root_task_group.se=(structsched_entity**)ptr;
ptr+=nr_cpu_ids*sizeof(void**);
root_task_group.cfs_rq=(structcfs_rq**)ptr;
ptr+=nr_cpu_ids*sizeof(void**);
#endif
Thesched_entityisstruturewhichdefinedintheinclude/linux/sched.handusedbytheschedulertokeeptrackofprocessaccounting.Thecfs_rqpresentsrunqueue.So,youcanseethatweallocatedspacewithsizealloc_sizefortherunqueueandschedulerentityoftheroot_task_group.Theroot_task_groupisaninstanceofthetask_groupstructurefromthekernel/sched/sched.hwhichcontainstaskgrouprelatedinformation:
structtask_group{
...
...
structsched_entity**se;
structcfs_rq**cfs_rq;
...
...
}
Theroottaskgroupisthetaskgroupwhichbelongseverytaskinsystem.Asweallocatedspacefortheroottaskgroupschedulerentityandrunqueue,wegooverallpossibleCPUs(cpu_possible_maskbitmap)andallocatezeroedmemoryfromaparticularmemorynodewiththekzalloc_nodefunctionfortheload_balance_maskpercpuvariable:
DECLARE_PER_CPU(cpumask_var_t,load_balance_mask);
LinuxInside
138Schedulerinitialization
Herecpumask_var_tisthecpumask_twithonedifference:cpumask_var_tisallocatedonlynr_cpu_idsbitswhenthecpumask_talwayshasNR_CPUSbits(moreaboutcpumaskyoucanreadintheCPUmaskspart).Asyoucansee:
#ifdefCONFIG_CPUMASK_OFFSTACK
for_each_possible_cpu(i){
per_cpu(load_balance_mask,i)=(cpumask_var_t)kzalloc_node(
cpumask_size(),GFP_KERNEL,cpu_to_node(i));
}
#endif
thiscodedependsontheCONFIG_CPUMASK_OFFSTACKconfigurationoption.Thisconfigurationoptionssaystousedynamicallocationforcpumask,insteadofputtingitonthestack.AllgroupshavetobeabletorelyontheamountofCPUtime.Withthecallofthetwofollowingfunctions:
init_rt_bandwidth(&def_rt_bandwidth,
global_rt_period(),global_rt_runtime());
init_dl_bandwidth(&def_dl_bandwidth,
global_rt_period(),global_rt_runtime());
weinitializebandwidthmanagementfortheSCHED_DEADLINEreal-timetasks.Thesefunctionsinitializesrt_bandwidthanddl_bandwidthstructureswhicharestoreinformationaboutmaximumdeadlinebandwithofthesystem.Forexample,let'slookontheimplementationoftheinit_rt_bandwidthfunction:
voidinit_rt_bandwidth(structrt_bandwidth*rt_b,u64period,u64runtime)
{
rt_b->rt_period=ns_to_ktime(period);
rt_b->rt_runtime=runtime;
raw_spin_lock_init(&rt_b->rt_runtime_lock);
hrtimer_init(&rt_b->rt_period_timer,
CLOCK_MONOTONIC,HRTIMER_MODE_REL);
rt_b->rt_period_timer.function=sched_rt_period_timer;
}
Ittakesthreeparameters:
addressofthert_bandwidthstructurewhichcontainsinformationabouttheallocatedandconsumedquotawithinaperiod;period-periodoverwhichreal-timetaskbandwidthenforcementismeasuredinus;runtime-partoftheperiodthatweallowtaskstoruninus.
Asperiodandruntimewepassresultoftheglobal_rt_periodandglobal_rt_runtimefunctions.Whichare1ssecondandand0.95sbydefault.Thert_bandwidthstructuredefinedinthekernel/sched/sched.handlooks:
structrt_bandwidth{
raw_spinlock_trt_runtime_lock;
ktime_trt_period;
u64rt_runtime;
structhrtimerrt_period_timer;
};
Asyoucansee,itcontainsruntimeandperiodandalsotwofollowingfields:
rt_runtime_lock-spinlockforthert_timeprotection;rt_period_timer-high-resolutionkerneltimerforunthrottledofreal-timetasks.
LinuxInside
139Schedulerinitialization
So,intheinit_rt_bandwidthweinitializert_bandwidthperiodandruntimewiththegivenparameters,initializethespinlockandhigh-resolutiontime.Inthenextstep,dependsontheenabledSMP,wemakeinitializationoftherootdomain:
#ifdefCONFIG_SMP
init_defrootdomain();
#endif
Thereal-timeschedulerrequiresglobalresourcestomakeschedulingdecision.ButunfortenatellyscalabilitybottlenecksappearasthenumberofCPUsincrease.Theconceptofrootdomainswasintroducedforimprovingscalability.ThelinuxkernelprovidesspecialmechanismforassigningasetofCPUsandmemorynodestoasetoftaskanditiscalled-cpuset.Ifacpusetcontainsnon-overlappingwithothercpusetCPUs,itisexclusivecpuset.EachexclusivecpusetdefinesanisolateddomainorrootdomainofCPUspartitionedfromothercpusetsorCPUs.Arootdomainpresentedbythestructroot_domainfromthekernel/sched/sched.hinthelinuxkernelanditsmainpurposeistonarrowthescopeoftheglobalvariablestoper-domainvariablesandallreal-timeschedulingdecisionsaremadeonlywithinthescopeofarootdomain.That'sallaboutit,butwewillseemoredetailsaboutitinthechapteraboutschedulingaboutreal-timescheduler.
Afterrootdomaininitialization,wemakeinitializationofthebandwidthforthereal-timetasksoftheroottaskgroupaswediditabove:
#ifdefCONFIG_RT_GROUP_SCHED
init_rt_bandwidth(&root_task_group.rt_bandwidth,
global_rt_period(),global_rt_runtime());
#endif
Inthenextstep,dependsontheCONFIG_CGROUP_SCHEDkernelconfigurationoptionweinitialzethesiblingsandchildrenlistsoftheroottaskgroup.Aswecanreadfromthedocumentation,theCONFIG_CGROUP_SCHEDis:
Thisoptionallowsyoutocreatearbitrarytaskgroupsusingthe"cgroup"pseudo
filesystemandcontrolthecpubandwidthallocatedtoeachsuchtaskgroup.
Aswefinishedwiththelistsinitialization,wecanseethecalloftheautogroup_initfunction:
#ifdefCONFIG_CGROUP_SCHED
list_add(&root_task_group.list,&task_groups);
INIT_LIST_HEAD(&root_task_group.children);
INIT_LIST_HEAD(&root_task_group.siblings);
autogroup_init(&init_task);
#endif
whichinitializesautomaticprocessgroupscheduling.
Afterthiswearegoingthroughtheallpossiblecpu(youcanrememberthatpossibleCPUsstoreinthecpu_possible_maskbitmapofpossibleCPUsthatcaneverbeavailableinthesystem)andinitializearunqueueforeachpossiblecpu:
for_each_possible_cpu(i){
structrq*rq;
...
...
...
Eachprocessorhasitsownlockingandindividualrunqueue.Allrunnalbletasksarestoredinanactivearrayandindexedaccordingtoitspriority.Whenaprocessconsumesitstimeslice,itismovedtoanexpiredarray.Allofthesearrasare
LinuxInside
140Schedulerinitialization
storedinthespecialstructurewhichnamesisrunqueu.Astherearenogloballockandrunqueu,wearegoingthroughtheallpossibleCPUsandinitializerunqueuefortheeverycpu.Therunqueispresentedbytherqstructureinthelinuxkernelwhichdefinedinthekernel/sched/sched.h.
rq=cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running=0;
rq->calc_load_active=0;
rq->calc_load_update=jiffies+LOAD_FREQ;
init_cfs_rq(&rq->cfs);
init_rt_rq(&rq->rt);
init_dl_rq(&rq->dl);
rq->rt.rt_runtime=def_rt_bandwidth.rt_runtime;
HerewegettherunquefortheeveryCPUwiththecpu_rqmactowhichreturnsrunqueuespercpuvariableandstarttoinitializeitwithrunqueulock,numberofrunningtasks,calc_loadrelativefields(calc_load_activeandcalc_load_update)whichareusedinthereckoningofaCPUloadandinitializationofthecompletelyfair,real-timeanddeadlinerelatedfieldsinarunqueue.Afterthisweinitializecpu_loadarraywithzerosandsetthelastloadupdateticktothejiffiesvariablewhichdeterminesthenumberoftimeticks(cycles),sincethesystemboot:
for(j=0;j<CPU_LOAD_IDX_MAX;j++)
rq->cpu_load[j]=0;
rq->last_load_update_tick=jiffies;
wherecpu_loadkeepshistoryofrunqueueloadsinthepast,fornowCPU_LOAD_IDX_MAXis5.InthenextstepwefillrunqueuefieldswhicharerelatedtotheSMP,butwewillnotcovertheyinthispart.Andintheendoftheloopweinitializehigh-resolutiontimerforthegiverunqueueandsettheiowait(moreaboutitintheseparatepartaboutscheduler)number:
init_rq_hrtick(rq);
atomic_set(&rq->nr_iowait,0);
Nowwecameoutfromthefor_each_possible_cpuloopandthenextweneedtosetloadweightfortheinittaskwiththeset_load_weightfunction.Weightofprocessiscalculatedthroughitsdynamicprioritywhichisstaticpriority+schedulingclassoftheprocess.Afterthisweincreasememoryusagecounterofthememorydescriptoroftheinitprocessandsetschedulerclassforthecurrentprocess:
atomic_inc(&init_mm.mm_count);
current->sched_class=&fair_sched_class;
Andmakecurrentprocess(itwillbethefirstinitprocess)idleandupdatethevalueofthecalc_load_updatewiththe5secondsinterval:
init_idle(current,smp_processor_id());
calc_load_update=jiffies+LOAD_FREQ;
So,theinitprocesswillberun,whentherewillbenoothercandidates(asitisthefirstprocessinthesystem).Intheendwejustsetscheduler_runningvariable:
scheduler_running=1;
LinuxInside
141Schedulerinitialization
That'sall.Linuxkernelschedulerisinitialized.Ofcourse,wemissedmanydifferentdetailsandexplanationshere,becauseweneedtoknowandunderstandhowdifferentconcepts(likeprocessandprocessgroups,runqueue,rcuandetc...)worksinthelinuxkernel,butwetookashortlookontheschedulerinitializationprocess.Allotherdetailswewilllookintheseparatepartwhichwillbefullydedicatedtothescheduler.
Itistheendoftheeighthpartaboutthelinuxkernelinitializationprocess.Inthispart,welookedontheinitializationprocessoftheschedulerandwewillcontinueinthenextparttodiveinthelinuxkernelinitializationprocessandwillseeinitializationoftheRCUandmanymore.
andotherinitializationstuffinthenextpart.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
CPUmaskshigh-resolutionkerneltimerspinlockRunqueueLinuxkernemmemorymanagerslubvirtualfilesystemLinuxkernelhotplugdocumentationIRQGlobalDescriptorTablePer-CPUvariablesSMPRCUCFSSchedulerdocumentationReal-TimegroupschedulingPreviouspart
Conclusion
Links
LinuxInside
142Schedulerinitialization
ThisisninthpartoftheLinuxKernelinitializationprocessandinthepreviouspartwestoppedattheschedulerinitialization.InthispartwewillcontinuetodivetothelinuxkernelinitializationprocessandthemainpurposeofthispartwillbetolearnaboutinitializationoftheRCU.Wecanseethatthenextstepintheinit/main.cafterthesched_initisthecallofthepreempt_disablepreempt_disable.Therearetwomacros:
preempt_disable
preempt_enable
forpreemptiondisablingandenabling.Firstofalllet'strytounderstandwhatisitpreemptinthecontextofanoperatingsystemkernel.Inasimplewords,preemptionisabilityoftheoperatingsystemkerneltopreemptcurrenttasktoruntaskwithhigherpriority.Hereweneedtodisablepreemptionbecausewewillhaveonlyoneinitprocessfortheearlyboottimeandwenoneedtostopitbeforewewillcallcpu_idlefunction.Thepreempt_disablemacrodefinedintheinclude/linux/preempt.handdependsontheCONFIG_PREEMPT_COUNTkernelconfigurationoption.Thismacoimplemetedas:
#definepreempt_disable()\
do{\
preempt_count_inc();\
barrier();\
}while(0)
andifCONFIG_PREEMPT_COUNTisnotsetjust:
#definepreempt_disable()barrier()
Let'slookonit.Firstofallwecanseeonedifferencebetweenthesemacroimplementations.Thepreempt_disablewithCONFIG_PREEMPT_COUNTcontainsthecallofthepreempt_count_inc.Thereisspecialpercpuvariablewhichstoresthenumberofheldlocksandpreempt_disablecalls:
DECLARE_PER_CPU(int,__preempt_count);
Inthefirstimplementationofthepreempt_disableweincrementthis__preempt_count.ThereisAPIforreturningvalueofthe__preempt_count,itisthepreempt_countfunction.Aswecalledpreempt_disable,firstofallweincrementpreemptioncounterwiththepreempt_count_incmacrowhichexpandstothe:
#definepreempt_count_inc()preempt_count_add(1)
#definepreempt_count_add(val)__preempt_count_add(val)
wherepreempt_count_addcallstheraw_cpu_add_4macrowhichadds1tothegivenpercpuvariable(__preempt_count)inourcase(moreaboutprecpuvariablesyoucanreadinthepartaboutPer-CPUvariables).Ok,weincreased__preempt_countandthnextstepwecanseethecallofthebarriermacrointhebothmacros.Thebarriermacroinsertsanoptimizationbarrier.Intheprocessorswithx86_64architectureindependentmemoryaccessoperationscanbeperformedinanyorder.That'swhyweneedintheoportunitytopointcompilerandprocessoroncomplianceoforder.Thismechanismismemorybarrier.Let'sconsidersimpleexample:
Kernelinitialization.Part9.
RCUinitialization
LinuxInside
143RCUinitialization
preempt_disable();
foo();
preempt_enable();
Compilercanrearrangeitas:
preempt_disable();
preempt_enable();
foo();
Inthiscasenon-preemptiblefunctionfoocanbepreempted.Asweputbarriermacrointhepreempt_disableandpreempt_enablemacros,itpreventsthecompilerfromswappingpreempt_count_incwithotherstatements.Moreaboutbarriersyoucanreadhereandhere.
Inthenextstepwecanseefollowingstatement:
if(WARN(!irqs_disabled(),
"Interruptswereenabled*very*early,fixingit\n"))
local_irq_disable();
whichcheckIRQsstate,anddisabling(withcliinstructionforx86_64)iftheyareenabled.
That'sall.Preemptionisdisabledandwecangoahead.
Inthenextstepwecanseethecalloftheidr_init_cachefunctionwhichdefinedinthelib/idr.c.TheidrlibraryusedinavariousplacesinthelinuxkerneltomanageassigningintegerIDstoobjectsandlookingupobjectsbyid.
Let'slookontheimplementationoftheidr_init_cachefunction:
void__initidr_init_cache(void)
{
idr_layer_cache=kmem_cache_create("idr_layer_cache",
sizeof(structidr_layer),0,SLAB_PANIC,NULL);
}
Herewecanseethecallofthekmem_cache_create.Wealreadycalledthekmem_cache_initintheinit/main.c.Thisfunctioncreategeneralizedcachesagainusingthekmem_cache_alloc(moreaboutcacheswewillseeintheLinuxkernelmemorymanagementchapter).Inourcase,asweareusingkmem_cache_titwillbeusedtheslaballocatorandkmem_cache_createcreatesit.Asyoucanseeewepassfiveparameterstothekmem_cache_create:
nameofthecache;sizeoftheobjecttostoreincache;offsetofthefirstobjectinthepage;flags;constructorfortheobjects.
anditwillcreatekmem_cachefortheintegerIDs.IntegerIDsiscommonlyusedpatternforthetomapsetofintegerIDstothesetofpointers.WecanseeusageoftheintegerIDsforexampleinthei2cdriverssubsystem.Forexampledrivers/i2c/i2c-core.cwhichpresentesthecoreofthei2csubsystemdefinesIDforthei2cadapterwiththeDEFINE_IDRmacro:
InitializationoftheintegerIDmanagement
LinuxInside
144RCUinitialization
staticDEFINE_IDR(i2c_adapter_idr);
andthanitusesitforthedeclarationofthei2cadapter:
staticint__i2c_add_numbered_adapter(structi2c_adapter*adap)
{
intid;
...
...
...
id=idr_alloc(&i2c_adapter_idr,adap,adap->nr,adap->nr+1,GFP_KERNEL);
...
...
...
}
andid2_adapter_idrpresentsdynamicallycalculatedbusnumber.
MoreaboutintegerIDmanagementyoucanreadhere.
ThenextstepisRCUinitializationwiththercu_initfunctionandit'simplementationdependsontwokernelconfigurationoptions:
CONFIG_TINY_RCU
CONFIG_TREE_RCU
Inthefirstcasercu_initwillbeinthekernel/rcu/tiny.candinthesecondcaseitwillbedefinedinthekernel/rcu/tree.c.Wewillseetheimplementationofthetreercu,butfirstofallabouttheRCUingeneral.
RCUorread-copyupdateisascalablehigh-performancesynchronizationmechanismimplementedintheLinuxkernel.Ontheearlystagethelinuxkernelprovidedsupportandenvironmentfortheconcurentlyrunningapplications,butallexecutionwasserializedinthekernelusingasinglegloballock.Inourdayslinuxkernelhasnosinglegloballock,butprovidesdifferentmechanismsincludinglock-freedatastructures,percpudatastructuresandother.Oneofthesemechanismsis-theread-copyupdate.TheRCUtechniquedesignedforrarely-modifieddatastructures.TheideaoftheRCUissimple.Forexamplewehaveararely-modifieddatastructure.Ifsomebodywantstochangethisdatastructure,wemakeacopyofthisdatastructureandmakeallchangesinthecopy.Inthesametimeallotherusersofthedatastructureuseoldversionofit.Next,weneedtochoosesafemomentwhenoriginalversionofthedatastructurewillhavenousersandupdateitwiththemodifiedcopy.
OfcoursethisdescriptionoftheRCUisverysimplified.TounderstandsomedetailsaboutRCU,firstofallweneedtolearnsometerminology.DatareadersintheRCUexecutedinthecriticalsection.Everytimewhendatareaderjoinstothecriticalsection,itcallsthercu_read_lock,andrcu_read_unlockonexitfromthecriticalsection.Ifthethreadisnotinthecriticalsection,itwillbeinstatewhichcalled-quiescentstate.Everymomentwheneverythreadwasinthequiescentstatecalled-graceperiod.Ifathreadwantstoremoveelementfromthedatastructure,thisoccursintwosteps.Firststepsisremoval-atomicallyremoveselementfromthedatastructure,butdoesnotreleasethephysicalmemory.Afterthisthread-writerannouncesandwaitswhileitwillbefinsihed.Fromthismoment,theremovedelementisavailabletothethread-readers.Afterthegraceperioudwillbefinished,thesecondstepoftheelementremovalwillbestarted,itjustremoveselementfromthephysicalmemory.
ThereacoupleimplementationsoftheRCU.OldRCUcalledclassic,thenewimplemetationcalledtreeRCU.Asyoualreadycanundrestand,theCONFIG_TREE_RCUkernelconfigurationoptionenablestreeRCU.AnotheristhetinyRCUwhichdependsonCONFIG_TINY_RCUandCONFIG_SMP=n.WewillseemoredetailsabouttheRCUingeneralintheseparate
RCUinitialization
LinuxInside
145RCUinitialization
chapteraboutsynchronizationprimitives,butnowlet'slookonthercu_initimplementationfromthekernel/rcu/tree.c:
void__initrcu_init(void)
{
intcpu;
rcu_bootup_announce();
rcu_init_geometry();
rcu_init_one(&rcu_bh_state,&rcu_bh_data);
rcu_init_one(&rcu_sched_state,&rcu_sched_data);
__rcu_init_preempt();
open_softirq(RCU_SOFTIRQ,rcu_process_callbacks);
/*
*Wedon'tneedprotectionagainstCPU-hotplugherebecause
*thisiscalledearlyinboot,beforeeitherinterrupts
*ortheschedulerareoperational.
*/
cpu_notifier(rcu_cpu_notify,0);
pm_notifier(rcu_pm_notify,0);
for_each_online_cpu(cpu)
rcu_cpu_notify(NULL,CPU_UP_PREPARE,(void*)(long)cpu);
rcu_early_boot_tests();
}
Inthebeginningofthercu_initfunctionwedefinecpuvariableandcallrcu_bootup_announce.Thercu_bootup_announcefunctionisprettysimple:
staticvoid__initrcu_bootup_announce(void)
{
pr_info("HierarchicalRCUimplementation.\n");
rcu_bootup_announce_oddness();
}
ItjustprintsinformationabouttheRCUwiththepr_infofunctionandrcu_bootup_announce_oddnesswhichusespr_infotoo,forprintingdifferentinformationaboutthecurrentRCUconfigurationwhichdependsondifferentkernelconfigurationoptionslikeCONFIG_RCU_TRACE,CONFIG_PROVE_RCU,CONFIG_RCU_FANOUT_EXACTandetc...Inthenextstep,wecanseethecallofthercu_init_geometryfunction.ThisfunctiondefinedinthesamesourcecodefileandcomputesthenodetreegeometrydependsonamountofCPUs.ActuallyRCUprovidesscalabilitywithextremelylowinternaltoRCUlockcontention.WhatifadatastructurewillbereadfromthedifferentCPUs?RCUAPIprovidesthercu_statestructurewihchpresentsRCUglobalstateincludingnodehierarchy.Hierachypresentedbythe:
structrcu_nodenode[NUM_RCU_NODES];
arrayofstructures.Aswecanreadinthecommentwhichisabovedefinitionofthisstructure:
Theroot(firstlevel)ofthehierarchyisin->node[0](referencedby->level[0]),thesecond
levelin->node[1]through->node[m](->node[1]referencedby->level[1]),andthethirdlevel
in->node[m+1]andfollowing(->node[m+1]referencedby->level[2]).Thenumberoflevelsis
determinedbythenumberofCPUsandbyCONFIG_RCU_FANOUT.
Smallsystemswillhavea"hierarchy"consistingofasinglercu_node.
Thercu_nodestructuredefinedinthekernel/rcu/tree.handcontainsinformationaboutcurrentgraceperiod,isgraceperiodcompletedornot,CPUsorgroupsthatneedtoswitchinorderforcurrentgraceperiodtoproceedandetc...Everyrcu_nodecontainsalockforacoupleofCPUs.Thesercu_nodestructuresembeddedintoalineararrayinthercu_statestructureandrepresetedasatreewiththerootinthezeroelementanditcoversallCPUs.AsyoucanseethenumberofthercunodesdeterminedbytheNUM_RCU_NODESwhichdependsonnumberofavailableCPUs:
LinuxInside
146RCUinitialization
#defineNUM_RCU_NODES(RCU_SUM-NR_CPUS)
#defineRCU_SUM(NUM_RCU_LVL_0+NUM_RCU_LVL_1+NUM_RCU_LVL_2+NUM_RCU_LVL_3+NUM_RCU_LVL_4)
wherelevelsvaluesdependontheCONFIG_RCU_FANOUT_LEAFconfigurationoption.Forexampleforthesimplestcase,onercu_nodewillcovertwoCPUonmachinewiththeeightCPUs:
+-----------------------------------------------------------------+
|rcu_state|
|+----------------------+|
||root||
||rcu_node||
|+----------------------+|
||||
|+----v-----++--v-------+|
||||||
||rcu_node||rcu_node||
||||||
|+------------------++----------------+|
||||||
||||||
|+----v-----++-------v--++-v--------++-v--------+|
||||||||||
||rcu_node||rcu_node||rcu_node||rcu_node||
||||||||||
|+----------++----------++----------++----------+|
||||||
||||||
||||||
||||||
+---------|-----------------|-------------|---------------|-------+
||||
+---------v-----------------v-------------v---------------v--------+
|||||
|CPU1|CPU3|CPU5|CPU7|
|||||
|CPU2|CPU4|CPU6|CPU8|
|||||
+------------------------------------------------------------------+
So,inthercu_init_geometryfunctionwejustneedtocalculatethetotalnumberofrcu_nodestructures.Westarttodoitwiththecalculationofthejiffiestilltothefirstandnextfqswhichisforce-quiescent-state(readaboveaboutit):
d=RCU_JIFFIES_TILL_FORCE_QS+nr_cpu_ids/RCU_JIFFIES_FQS_DIV;
if(jiffies_till_first_fqs==ULONG_MAX)
jiffies_till_first_fqs=d;
if(jiffies_till_next_fqs==ULONG_MAX)
jiffies_till_next_fqs=d;
where:
#defineRCU_JIFFIES_TILL_FORCE_QS(1+(HZ>250)+(HZ>500))
#defineRCU_JIFFIES_FQS_DIV256
Aswecalculatedthesejiffies,wecheckthatpreviousdefinedjiffies_till_first_fqsandjiffies_till_next_fqsvariablesareequaltotheULONG_MAX(theirdefaultvalues)andsettheyequaltothecalculatedvalue.Aswedidnottouchthesevariablesbefore,theyareequaltotheULONG_MAX:
staticulongjiffies_till_first_fqs=ULONG_MAX;
staticulongjiffies_till_next_fqs=ULONG_MAX;
LinuxInside
147RCUinitialization
Inthenextstepofthercu_init_geometry,wecheckthatrcu_fanout_leafdidn'tchage(ithasthesamevalueasCONFIG_RCU_FANOUT_LEAFincompile-time)andequaltothevalueoftheCONFIG_RCU_FANOUT_LEAFconfigurationoption,wejustreturn:
if(rcu_fanout_leaf==CONFIG_RCU_FANOUT_LEAF&&
nr_cpu_ids==NR_CPUS)
return;
Afterthisweneedtocomputethenumberofnodesthatcanbehandledanrcu_nodetreewiththegivennumberoflevels:
rcu_capacity[0]=1;
rcu_capacity[1]=rcu_fanout_leaf;
for(i=2;i<=MAX_RCU_LVLS;i++)
rcu_capacity[i]=rcu_capacity[i-1]*CONFIG_RCU_FANOUT;
Andinthelaststepwecalcluatethenumberofrcu_nodesateachlevelofthetreeintheloop.
Aswecalculatedgeometryofthercu_nodetree,weneedtobacktothercu_initfunctionandnextstepweneedtoinitializetworcu_statestructureswiththercu_init_onefunction:
rcu_init_one(&rcu_bh_state,&rcu_bh_data);
rcu_init_one(&rcu_sched_state,&rcu_sched_data);
Thercu_init_onefunctiontakestwoarguments:
GlobalRCUstate;Per-CPUdataforRCU.
Bothvariablesdefinedinthekernel/rcu/tree.hwithitspercpudata:
externstructrcu_statercu_bh_state;
DECLARE_PER_CPU(structrcu_data,rcu_bh_data);
Aboutthisstatesyoucanreadhere.AsIwroteaboveweneedtoinitializercu_statestructuresandrcu_init_onefunctionwillhelpuswithit.Afterthercu_stateinitialization,wecanseethecallofthe__rcu_init_preemptwhichdependsontheCONFIG_PREEMPT_RCUkernelconfigurationoption.Itdoesthesamethatpreviousfunctions-initializationofthercu_preempt_statestructurewiththercu_init_onefunctionwhichhasrcu_statetype.Afterthis,inthercu_init,wecanseethecallofthe:
open_softirq(RCU_SOFTIRQ,rcu_process_callbacks);
function.Thisfunctionregistersahandlerofthependinginterrupt.Pendinginterruptorsoftirqsupposesthatpartofactionscabbedelayedforlaterexecutionwhenthesystemwillbelessloaded.Pendinginterruptsrepresetedbythefollowingstructure:
structsoftirq_action
{
void(*action)(structsoftirq_action*);
};
LinuxInside
148RCUinitialization
whichdefinedintheinclude/linux/interrupt.handcontainsonlyonefield-handlerofaninterrupt.Youcanknowaboutsoftirqsintheyoursystemwiththe:
$cat/proc/softirqs
CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7
HI:20010200
TIMER:1377791081101395731076471074081149729965398665
NET_TX:11270401100
NET_RX:3342211329393076451361292303
BLOCK:525355968779201637442282855
BLOCK_IOPOLL:00000000
TASKLET:6602916113024267080
SCHED:10235075950917057535675323826276927969914
HRTIMER:510302368260219255248246
RCU:8129068062829796901568390693856330463473
Theopen_softirqfunctiontakestwoparameters:
indexoftheinterrupt;interrupthandler.
andaddsinterrupthandlertothearrayofthependinginterrupts:
voidopen_softirq(intnr,void(*action)(structsoftirq_action*))
{
softirq_vec[nr].action=action;
}
Inourcasetheinterrupthandleris-rcu_process_callbackswhichdefinedinthekernel/rcu/tree.canddoestheRCUcoreprocessingforthecurrentCPU.AfterweregisteredsoftirqinterruptfortheRCU,wecanseethefollowingcode:
cpu_notifier(rcu_cpu_notify,0);
pm_notifier(rcu_pm_notify,0);
for_each_online_cpu(cpu)
rcu_cpu_notify(NULL,CPU_UP_PREPARE,(void*)(long)cpu);
HerewecanseeregistrationofthecpunotifierwhichneedsinsysmtemswhichsupportsCPUhotplugandwewillnotdiveintodetailsaboutthistheme.Thelastfunctioninthercu_initisthercu_early_boot_tests:
voidrcu_early_boot_tests(void)
{
pr_info("RunningRCUselftests\n");
if(rcu_self_test)
early_boot_test_call_rcu();
if(rcu_self_test_bh)
early_boot_test_call_rcu_bh();
if(rcu_self_test_sched)
early_boot_test_call_rcu_sched();
}
whichrunsselftestsfortheRCU.
That'sall.WesawinitializationprocessoftheRCUsubsystem.AsIwroteabove,moreabouttheRCUwillbeintheseparatechapteraboutsynchronizationprimitives.
Restoftheinitializationprocess
LinuxInside
149RCUinitialization
Ok,wealreadypassedthemainthemeofthispartwhichisRCUinitialization,butitisnottheendofthelinuxkernelinitializationprocess.Inthelastparagraphofthisthemewewillseeacoupleoffunctionswhichworkintheinitializationtime,butwewillnotdiveintodeepdetailsaroundthisfunctionbydifferentreasons.Somereasonsnottodiveintodetailsarefollowing:
Theyarenotveryimportantforthegenerickernelinitializationprocessandcandependonthedifferentkernelconfiguration;Theyhavethecharacterofdebuggingandnotimportanttoofornow;Wewillseemanyofthisstuffintheseparateparts/chapters.
AfterweinitilizedRCU,thenextstepwhichyoucanseeintheinit/main.cisthe-trace_initfunction.Asyoucanunderstandfromitsname,thisfunctioninitializetracingsubsystem.Moreaboutlinuxkerneltracesystemyoucanread-here.
Afterthetrace_init,wecanseethecalloftheradix_tree_init.Ifyouarefamilarwiththedifferentdatastructures,youcanunderstandfromthenameofthisfunctionthatitinitializeskernelimplementationoftheRadixtree.Thisfunctiondefinedinthelib/radix-tree.candmoreaboutityoucanreadinthepartaboutRadixtree.
Inthenextstepwecanseethefunctionswhicharerelatedtotheinterruptshandlingsubsystem,theyare:
early_irq_init
init_IRQ
softirq_init
Wewillseeexplanationaboutthisfunctionsandtheirimplementationinthespecialpartaboutinterruptsandexceptionshandling.Afterthismanydifferentfunctions(likeinit_timers,hrtimers_init,time_initandetc...)whicharerelatedtodifferenttimingandtimersstuff.Moreaboutthesefunctionwewillseeinthechapterabouttimers.
Thenextcoupleoffunctionsrelatedwiththeperfevents-perf_event-init(willbeseparatechapteraboutperf),initializationoftheprofilingwiththeprofile_init.Afterthisweenableirqwiththecallofthe:
local_irq_enable();
whichexpandstothestiinstructionandmakingpostinitializationoftheSLABwiththecallofthekmem_cache_init_latefunction(AsIwroteabovewewillknowabouttheSLABintheLinuxmemorymanagementchapter).
AfterthepostinitializationoftheSLAB,nextpointisinitializationoftheconsolewiththeconsole_initfunctionfromthedrivers/tty/tty_io.c.
Aftertheconsoleinitialization,wecanseethelockdep_infofunctionwhichprintsinformationabouttheLockdependencyvalidator.Afterthis,wecanseetheinitializationofthedynamicallocationofthedebugobjectswiththedebug_objects_mem_init,kernelmemoryleackdetectorinitializationwiththekmemleak_init,percpupagesetsetupwiththesetup_per_cpu_pageset,setupoftheNUMApolicywiththenuma_policy_init,settingtimefortheschedulerwiththesched_clock_init,pidmapinitializationwiththecallofthepidmap_initfunctionfortheinitialPIDnamespace,cachecreationwiththeanon_vma_initfortheprivatevirtualmemoryareasandearlyinitializationoftheACPIwiththeacpi_early_init.
ThisistheendoftheninthpartofthelinuxkernelinitializationprocessandherewesawinitializationoftheRCU.Inthelastparagraphofthispart(Restoftheinitializationprocess)wewentthorughthemanyfunctionsbutdidnotdiveintodetailsabouttheirimplementations.Donotworryifyoudonotknowanythingaboutthesestufforyouknowanddonotunderstandanythingaboutthis.AsIwrotealreadymanytimes,wewillseedetailsofimplementations,butintheotherpartsorotherchapters.
LinuxInside
150RCUinitialization
Itistheendoftheninthpartaboutthelinuxkernelinitializationprocess.Inthispart,welookedontheinitializationprocessoftheRCUsubsystem.InthenextpartwewillcontinuetodiveintolinuxkernelinitializationprocessandIhopethatwewillfinishwiththestart_kernelfunctionandwillgototherest_initfunctionfromthesameinit/main.csourcecodefileandwillseethatstartofthefirstprocess.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
lock-freedatastructureskmemleakACPIIRQsRCURCUdocumentationintegerIDmanagementDocumentation/memory-barriers.txtRuntimelockingcorrectnessvalidatorPer-CPUvariablesLinuxkernelmemorymanagementslabi2cPreviouspart
Conclusion
Links
LinuxInside
151RCUinitialization
ThisistenthpartofthechapteraboutlinuxkernelinitializationprocessandinthepreviouspartwesawtheinitializationoftheRCUandstoppedonthecalloftheacpi_early_initfunction.ThispartwillbethelastpartoftheKernelinitializationprocesschapter,solet'sfinishwithit.
Afterthecalloftheacpi_early_initfunctionfromtheinit/main.c,wecanseethefollowingcode:
#ifdefCONFIG_X86_ESPFIX64
init_espfix_bsp();
#endif
Herewecanseethecalloftheinit_espfix_bspfunctionwhichdependsontheCONFIG_X86_ESPFIX64kernelconfigurationoption.Aswecanunderstandfromthefunctionname,itdoessomethingwiththestack.Thisfunctiondefinedinthearch/x86/kernel/espfix_64.candpreventsleakingof31:16bitsoftheespregisterduringreturningto16-bitstack.Firstofallweinstallespfixpageupperdirectoryintothekernelpagedirectoryintheinit_espfix_bs:
pgd_p=&init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
pgd_populate(&init_mm,pgd_p,(pud_t*)espfix_pud_page);
WhereESPFIX_BASE_ADDRis:
#definePGDIR_SHIFT39
#defineESPFIX_PGD_ENTRY_AC(-2,UL)
#defineESPFIX_BASE_ADDR(ESPFIX_PGD_ENTRY<<PGDIR_SHIFT)
AlsowecanfinditintheDocumentation/arch/x86_64/mm:
...unusedhole...
ffffff0000000000-ffffff7fffffffff(=39bits)%espfixupstacks
...unusedhole...
Afterwe'vefilledpageglobaldirectorywiththeespfixpud,thenextstepiscalloftheinit_espfix_randomandinit_espfix_apfunctions.ThefirstfunctionreturnsrandomlocationsfortheespfixpageandthesecondenablestheespfixthecurrentCPU.Aftertheinit_espfix_bspfinishedtowork,wecanseethecallofthethread_info_cache_initfunctionwhichdefinedinthekernel/fork.candallocatescacheforthethread_infoifitssizeislessthanPAGE_SIZE:
#ifTHREAD_SIZE>=PAGE_SIZE
...
...
...
voidthread_info_cache_init(void)
{
thread_info_cache=kmem_cache_create("thread_info",THREAD_SIZE,
THREAD_SIZE,0,NULL);
BUG_ON(thread_info_cache==NULL);
}
...
...
Kernelinitialization.Part10.
Endofthelinuxkernelinitializationprocess
LinuxInside
152Endofinitialization
...
#endif
AswealreadyknowthePAGE_SIZEis(_AC(1,UL)<<PAGE_SHIFT)or4096bytesandTHREAD_SIZEis(PAGE_SIZE<<THREAD_SIZE_ORDER)or16384bytesforthex86_64.Thenextfunctionafterthethread_info_cache_initisthecred_initfromthekernel/cred.c.Thisfunctionjustallocatesspaceforthecredentials(likeuid,gidandetc...):
void__initcred_init(void)
{
cred_jar=kmem_cache_create("cred_jar",sizeof(structcred),
0,SLAB_HWCACHE_ALIGN|SLAB_PANIC,NULL);
}
moreaboutcredentialsyoucanreadintheDocumentation/security/credentials.txt.Nextstepisthefork_initfunctionfromthekernel/fork.c.Thefork_initfunctionallocatesspaceforthetask_struct.Let'slookontheimplementationofthefork_init.FirstofallwecanseedefinitionsoftheARCH_MIN_TASKALIGNmacroandcreationofaslabwheretask_structswillbeallocated:
#ifndefCONFIG_ARCH_TASK_STRUCT_ALLOCATOR
#ifndefARCH_MIN_TASKALIGN
#defineARCH_MIN_TASKALIGNL1_CACHE_BYTES
#endif
task_struct_cachep=
kmem_cache_create("task_struct",sizeof(structtask_struct),
ARCH_MIN_TASKALIGN,SLAB_PANIC|SLAB_NOTRACK,NULL);
#endif
AswecanseethiscodedependsontheCONFIG_ARCH_TASK_STRUCT_ACLLOCATORkernelconfigurationoption.Thisconfigurationoptionshowsthepresenceofthealloc_task_structforthegivenarchitecture.Asx86_64hasnoalloc_task_structfunction,thiscodewillnotworkandevenwillnotbecompiledonthex86_64.
Afterthiswecanseethecallofthearch_task_cache_initfunctioninthefork_init:
voidarch_task_cache_init(void)
{
task_xstate_cachep=
kmem_cache_create("task_xstate",xstate_size,
__alignof__(unionthread_xstate),
SLAB_PANIC|SLAB_NOTRACK,NULL);
setup_xstate_comp();
}
Thearch_task_cache_initdoesinitializationofthearchitecture-specificcaches.Inourcaseitisx86_64,soaswecansee,thearch_task_cache_initallocatesspaceforthetask_xstatewhichrepresentsFPUstateandsetsupoffsetsandsizesofallextendedstatesinxsaveareawiththecallofthesetup_xstate_compfunction.Afterthearch_task_cache_initwecalculatedefaultmaximumnumberofthreadswiththe:
set_max_threads(MAX_THREADS);
wheredefaultmaximumnumberofthreadsis:
Allocatingcacheforinittask
LinuxInside
153Endofinitialization
#defineFUTEX_TID_MASK0x3fffffff
#defineMAX_THREADSFUTEX_TID_MASK
Intheendofthefork_initfunctionweinitalizesignalhandler:
init_task.signal->rlim[RLIMIT_NPROC].rlim_cur=max_threads/2;
init_task.signal->rlim[RLIMIT_NPROC].rlim_max=max_threads/2;
init_task.signal->rlim[RLIMIT_SIGPENDING]=
init_task.signal->rlim[RLIMIT_NPROC];
Asweknowtheinit_taskisaninstanceofthetask_structstructure,soitcontainssignalfieldwhichrepresentssignalhandler.Ithasfollowingtypestructsignal_struct.Onthefirsttwolineswecanseesettingofthecurrentandmaximumlimitoftheresourcelimits.Everyprocesshasanassociatedsetofresourcelimits.Theselimitsspecifyamountofresourceswhichcurrentprocesscanuse.Hererlimisresourcecontrollimitandpresentedbythe:
structrlimit{
__kernel_ulong_trlim_cur;
__kernel_ulong_trlim_max;
};
structurefromtheinclude/uapi/linux/resource.h.InourcasetheresourceistheRLIMIT_NPROCwhichisthemaximumnumberofprocessthatusecanownandRLIMIT_SIGPENDING-themaximumnumberofpendingsignals.Wecanseeitinthe:
cat/proc/self/limits
LimitSoftLimitHardLimitUnits
...
...
...
Maxprocesses6381563815processes
Maxpendingsignals6381563815signals
...
...
...
Thenextfunctionafterthefork_initistheproc_caches_initfromthekernel/fork.c.Thisfunctionallocatescachesforthememorydescriptors(ormm_structstructure).Atthebeginningoftheproc_caches_initwecanseeallocationofthedifferentSLABcacheswiththecallofthekmem_cache_create:
sighand_cachep-manageinformationaboutinstalledsignalhandlers;signal_cachep-manageinformationaboutprocesssignaldescriptor;files_cachep-manageinformationaboutopenedfiles;fs_cachep-managefilesysteminformation.
AfterthisweallocateSLABcacheforthemm_structstructures:
mm_cachep=kmem_cache_create("mm_struct",
sizeof(structmm_struct),ARCH_MIN_MMSTRUCT_ALIGN,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK,NULL);
AfterthisweallocateSLABcachefortheimportantvm_area_structwhichusedbythekerneltomanagevirtualmemory
Initializationofthecaches
LinuxInside
154Endofinitialization
space:
vm_area_cachep=KMEM_CACHE(vm_area_struct,SLAB_PANIC);
Note,thatweuseKMEM_CACHEmacrohereinsteadofthekmem_cache_create.Thismacrodefinedintheinclude/linux/slab.handjustexpandstothekmem_cache_createcall:
#defineKMEM_CACHE(__struct,__flags)kmem_cache_create(#__struct,\
sizeof(struct__struct),__alignof__(struct__struct),\
(__flags),NULL)
TheKMEM_CACHEhasonedifferencefromkmem_cache_create.Takealookon__alignof__operator.TheKMEM_CACHEmacroalignsSLABtothesizeofthegivenstructure,butkmem_cache_createusesgivenvaluetoalignspace.Afterthiswecanseethecallofthemmap_initandnsproxy_cache_initfunctions.ThefirstfunctioninitalizesvirtualmemoryareaSLABandthesecondfunctioninitializesSLABfornamespaces.
Thenextfunctionaftertheproc_caches_initisbuffer_init.Thisfunctiondefinedinthefs/buffer.csourcecodefileandallocatecacheforthebuffer_head.Thebuffer_headisaspecialstructurewhichdefinedintheinclude/linux/buffer_head.handusedformanagingbuffers.Inthestartofthebufer_initfunctionweallocatecacheforthestructbuffer_headstructureswiththecallofthekmem_cache_createfunctionaswediditinthepreviousfunctions.Andcalcuatethemaximumsizeofthebuffersinmemorywith:
nrpages=(nr_free_buffer_pages()*10)/100;
max_buffer_heads=nrpages*(PAGE_SIZE/sizeof(structbuffer_head));
whichwillbeequaltothe10%oftheZONE_NORMAL(allRAMfromthe4GBonthex86_64).Thenextfunctionafterthebuffer_initis-vfs_caches_init.ThisfunctionallocatesSLABcachesandhashtablefordifferentVFScaches.Wealreadysawthevfs_caches_init_earlyfunctionintheeighthpartofthelinuxkernelinitializationprocesswhichinitializedcachesfordcache(ordirectory-cache)andinodecache.Thevfs_caches_initfunctionmakespost-earlyinitializationofthedcacheandinodecaches,privatedatacache,hashtablesforthemountpointsandetc...MoredetailsaboutVFSwillbedescribedintheseparatepart.Afterthiswecanseesignals_initfunction.Thisfunctiondefinedinthekernel/signal.candallocatesacacheforthesigqueuestructureswhichrepresentsqueueoftherealtimesignals.Thenextfunctionispage_writeback_init.Thisfunctioninitializestheratioforthedirtypages.Everylow-levelpageentrycontainsthedirtybitwhichindicateswhetherapagehasbeenwrittentowhenset.
Afterallofthispreparationsweneedtocreatetherootfortheprocfilesystem.Wewilldoitwiththecalloftheproc_root_initfunctionfromthefs/proc/root.c.Atthestartoftheproc_root_initfunctionweallocatethecachefortheinodesandregisteranewfilesysteminthesystemwiththe:
err=register_filesystem(&proc_fs_type);
if(err)
return;
AsIwroteabovewewillnotdiveintodetailsaboutVFSanddifferentfilesystemsinthischapter,butwillseeitinthechapterabouttheVFS.Afterwe'veregisteredanewfilesystemintheoursystem,wecalltheproc_self_initfunctionfromtheTOfs/proc/self.candthisfunctionallocatesinodenumberfortheself(/proc/selfdirectoryreferstotheprocessaccessingthe/procfilesystem).Thenextstepaftertheproc_self_initisproc_setup_thread_selfwhichsetupsthe/proc/thread-selfdirectorywhichcontainsinformationaboutcurrentthread.Afterthiswecreate/proc/self/mounts
Creationoftherootfortheprocfs
LinuxInside
155Endofinitialization
symllinkwhichwillcontainsmountpointswiththecallofthe
proc_symlink("mounts",NULL,"self/mounts");
andacoupleofdirectoriesdependsonthedifferentconfigurationoptions:
#ifdefCONFIG_SYSVIPC
proc_mkdir("sysvipc",NULL);
#endif
proc_mkdir("fs",NULL);
proc_mkdir("driver",NULL);
proc_mkdir("fs/nfsd",NULL);
#ifdefined(CONFIG_SUN_OPENPROMFS)||defined(CONFIG_SUN_OPENPROMFS_MODULE)
proc_mkdir("openprom",NULL);
#endif
proc_mkdir("bus",NULL);
...
...
...
if(!proc_mkdir("tty",NULL))
return;
proc_mkdir("tty/ldisc",NULL);
...
...
...
Intheendoftheproc_root_initwecalltheproc_sys_initfunctionwhichcreates/proc/sysdirectoryandinitializestheSysctl.
Itistheendofstart_kernelfunction.Ididnotdescribeallfunctionswhicharecalledinthestart_kernel.Imissedit,becausetheyarenotsoimportantforthegenerickernelinitializationstuffanddependononlydifferentkernelconfigurations.Theyaretaskstats_init_earlywhichexportsper-taskstatistictotheuser-space,delayacct_init-initializesper-taskdelayaccounting,key_initandsecurity_initinitializediferentsecuritystuff,check_bugs-makesfixupofthesomearchitecture-dependentbugs,ftrace_initfunctionexecutesinitializationoftheftrace,cgroup_initmakesinitializationoftherestofthecgroupsubsystemandetc...Manyofthesepartsandsubsystemswillbedescribedintheotherchapters.
That'sall.Finallywepassedthroughthelong-longstart_kernelfunction.Butitisnottheendofthelinuxkernelinitializationprocess.Wehaven'trunthefirstprocessyet.Intheendofthestart_kernelwecanseethelastcallofthe-rest_initfunction.Let'sgoahead.
Therest_initfunctiondefinedinthesamesourcecodefileasstart_kernelfunction,andthisfileisinit/main.c.Inthebeginningoftherest_initwecanseecallofthetwofollowingfunctions:
rcu_scheduler_starting();
smpboot_thread_init();
Thefirstrcu_scheduler_startingmakesRCUscheduleractiveandthesecondsmpboot_thread_initregistersthesmpboot_thread_notifierCPUnotifier(moreaboutityoucanreadintheCPUhotplugdocumentation.Afterthiswecanseethefollowingcalls:
kernel_thread(kernel_init,NULL,CLONE_FS);
pid=kernel_thread(kthreadd,NULL,CLONE_FS|CLONE_FILES);
Firststepsafterthestart_kernel
LinuxInside
156Endofinitialization
Herethekernel_threadfunction(definedinthekernel/fork.c)createsnewkernelthread.Aswecanseethekernel_threadfunctiontakesthreearguments:
Functionwhichwillbeexecutedinanewthread;Parameterforthekernel_initfunction;Flags.
Wewillnotdiveintodetailsaboutkernel_threadimplementation(wewillseeitinthechapterwhichwilldescribescheduler,justneedtosaythatkernel_threadinvokesclone).Nowweonlyneedtoknowthatwecreatenewkernelthreadwithkernel_threadfunction,parentandchildofthethreadwillusesharedinformationaboutafilesystemanditwillstarttoexecutekernel_initfunction.Akernelthreaddiffersfromanuserthreadthatitrunsinakernelmode.Sowiththesetwokernel_threadcallswecreatetwonewkernelthreadswiththePID=1forinitprocessandPID=2forkthread.Wealreadyknowwhatisinitprocess.Let'slookonthekthread.Itisspecialkernelthreadwhichallowstoinitanddifferentpartsofthekerneltocreateanotherkernelthreads.Wecanseeitintheoutputofthepsutil:
$ps-ef|grepkthradd
alex128664767018:26pts/000:00:00grepkthradd
Let'spostponekernel_initandkthreaddfornowandwillgoaheadintherest_init.Inthenextstepafterwehavecreatedtwonewkernelthreadswecanseethefollowingcode:
rcu_read_lock();
kthreadd_task=find_task_by_pid_ns(pid,&init_pid_ns);
rcu_read_unlock();
Thefirstrcu_read_lockfunctionmarksthebeginningofanRCUread-sidecriticalsectionandthercu_read_unlockmarkstheendofanRCUread-sidecriticalsection.Wecallthesefunctionsbecauseweneedtoprotectthefind_task_by_pid_ns.Thefind_task_by_pid_nsreturnspointertothetask_structbythegivenpid.So,herewearegettingthepointertothetask_structforthePID=2(wegotitafterkthreaddcreationwiththekernel_thread).Inthenextstepwecallcompletefunction
complete(&kthreadd_done);
andpassaddressofthekthreadd_done.Thekthreadd_donedefinedas
static__initdataDECLARE_COMPLETION(kthreadd_done);
whereDECLARE_COMPLETIONmacrodefinedas:
#defineDECLARE_COMPLETION(work)\
structcompletionwork=COMPLETION_INITIALIZER(work)
andexpandstothedefinitionofthecompletionstructure.Thisstructuredefinedintheinclude/linux/completion.handpresentscompletionsconcept.Completionsareacodesynchronizationmechanismwhichisproviderace-freesolutionforthethreadsthatmustwaitforsomeprocesstohavereachedapointoraspecificstate.Usingcompletionsconsistsofthreeparts:ThefirstisdefinitionofthecompletestructureandwediditwiththeDECLARE_COMPLETION.Thesecondiscallofthewait_for_completion.Afterthecallofthisfunction,athreadwhichcalleditwillnotcontinuetoexecuteandwillwaitwhileotherthreaddidnotcallcompletefunction.Notethatwecallwait_for_completionwiththekthreadd_doneinthebeginning
LinuxInside
157Endofinitialization
ofthekernel_init_freeable:
wait_for_completion(&kthreadd_done);
Andthelaststepistocallcompletefunctionaswesawitabove.Afterthisthekernel_init_freeablefunctionwillnotbeexecutedwhilekthreaddthreadwillnotbeset.Afterthekthreaddwasset,wecanseethreefollowingfunctionsintherest_init:
init_idle_bootup_task(current);
schedule_preempt_disabled();
cpu_startup_entry(CPUHP_ONLINE);
Thefirstinit_idle_bootup_taskfunctionfromthekernel/sched/core.csetstheSchedulingclassforthecurrentprocess(idleclassinourcase):
voidinit_idle_bootup_task(structtask_struct*idle)
{
idle->sched_class=&idle_sched_class;
}
whereidleclassisalowprioritytasksandtaskscanberunonlywhentheprocessordoesn'thavetorunanythingbesidesthistasks.Thesecondfunctionschedule_preempt_disableddisablespreemptinidletasks.Andthethirdfunctioncpu_startup_entrydefinedinthekernel/sched/idle.candcallscpu_idle_loopfromthekernel/sched/idle.c.Thecpu_idle_loopfunctionworksasprocesswithPID=0andworksinthebackground.Mainpurposeofthecpu_idle_loopisusageoftheidleCPUcycles.Whentherearenooneprocesstorun,thisprocessstartstowork.Wehaveoneprocesswithidleschedulingclass(wejustsetthecurrenttasktotheidlewiththecalloftheinit_idle_bootup_taskfunction),sotheidlethreaddoesnotdousefulworkandchecksthatthereisnotactivetasktoswitch:
staticvoidcpu_idle_loop(void)
{
...
...
...
while(1){
while(!need_resched()){
...
...
...
}
...
}
Moreaboutitwillbeinthechapteraboutscheduler.Soforthismomentthestart_kernelcallstherest_initfunctionwhichspawnsaninit(kernel_initfunction)processandbecomeidleprocessitself.Nowistimetolookonthekernel_init.Executionofthekernel_initfunctionstartsfromthecallofthekernel_init_freeablefunction.Thekernel_init_freeablefunctionfirstofallwaitsforthecompletionofthekthreaddsetup.Ialreadywroteaboutitabove:
wait_for_completion(&kthreadd_done);
Afterthiswesetgfp_allowed_maskto__GFP_BITS_MASKwhichmeansthatalreadysystemisrunning,setallowedcpus/memstoallCPUsandNUMAnodeswiththeset_mems_allowedfunction,allowinitprocesstorunonanyCPUwiththeset_cpus_allowed_ptr,setpidforthecadorCtrl-Alt-Delete,dopreparationforbootingoftheotherCPUswiththecallofthesmp_prepare_cpus,callearlyinitcallswiththedo_pre_smp_initcalls,initializationoftheSMPwiththesmp_initand
LinuxInside
158Endofinitialization
initializationofthelockup_detectorwiththecallofthelockup_detector_initandinitializeschedulerwiththesched_init_smp.
Afterthiswecanseethecallofthefollowingfunctions-do_basic_setup.Beforewewillcallthedo_basic_setupfunction,ourkernelalreadyinitializedforthismoment.Ascommentsays:
Nowwecanfinallystartdoingsomerealwork..
Thedo_basic_setupwillreinitializecpusettotheactiveCPUs,initializationofthekhelper-whichisakernelthreadwhichusedformakingcallsouttouserspacefromwithinthekernel,initializetmpfs,initializedriverssubsystem,enabletheuser-modehelperworkqueueandmakepost-earlycalloftheinitcalls.Wecanseeopeninngofthedev/consoleandduptwicefiledescriptorsfrom0to2afterthedo_basic_setup:
if(sys_open((constchar__user*)"/dev/console",O_RDWR,0)<0)
pr_err("Warning:unabletoopenaninitialconsole.\n");
(void)sys_dup(0);
(void)sys_dup(0);
Weareusingtwosystemcallsheresys_openandsys_dup.Inthenextchapterswewillseeexplanationandimplementationofthedifferentsystemcalls.Afterweopenedinitialconsole,wecheckthatrdinit=optionwaspassedtothekernelcommandlineorsetdefaultpathoftheramdisk:
if(!ramdisk_execute_command)
ramdisk_execute_command="/init";
Checkuser'spermissionsfortheramdiskandcalltheprepare_namespacefunctionfromtheinit/do_mounts.cwhichchecksandmountstheinitrd:
if(sys_access((constchar__user*)ramdisk_execute_command,0)!=0){
ramdisk_execute_command=NULL;
prepare_namespace();
}
Thisistheendofthekernel_init_freeablefunctionandweneedreturntothekernel_init.Thenextstepafterthekernel_init_freeablefinisheditsexecutionistheasync_synchronize_full.Thisfunctionwaitsuntilallasynchronousfunctioncallshavebeendoneandafteritwewillcallthefree_initmemwhichwillreleaseallmemoryoccupiedbytheinitializationstuffwhichlocatedbetween__init_beginand__init_end.Afterthisweprotect.rodatawiththemark_rodata_roandupdatestateofthesystemfromtheSYSTEM_BOOTINGtothe
system_state=SYSTEM_RUNNING;
Andtriestoruntheinitprocess:
if(ramdisk_execute_command){
ret=run_init_process(ramdisk_execute_command);
if(!ret)
return0;
pr_err("Failedtoexecute%s(error%d)\n",
ramdisk_execute_command,ret);
}
LinuxInside
159Endofinitialization
Firstofallitcheckstheramdisk_execute_commandwhichwesetinthekernel_init_freeablefunctionanditwillbeequaltothevalueoftherdinit=kernelcommandlineparametersor/initbydefault.Therun_init_processfunctionfillsthefirstelementoftheargv_initarray:
staticconstchar*argv_init[MAX_INIT_ARGS+2]={"init",NULL,};
whichrepresentsargumentsoftheinitprogramandcalldo_execvefunction:
argv_init[0]=init_filename;
returndo_execve(getname_kernel(init_filename),
(constchar__user*const__user*)argv_init,
(constchar__user*const__user*)envp_init);
Thedo_execvefunctiondefinedintheinclude/linux/sched.handrunsprogramwiththegivenfilenameandarguments.Ifwedidnotpassrdinit=optiontothekernelcommandline,kernelstartstochecktheexecute_commandwhichisequaltovalueoftheinit=kernelcommandlineparameter:
if(execute_command){
ret=run_init_process(execute_command);
if(!ret)
return0;
panic("Requestedinit%sfailed(error%d).",
execute_command,ret);
}
Ifwedidnotpassinit=kernelcommandlineparametertoo,kerneltriestorunoneofthefollowingexecutablefiles:
if(!try_to_run_init_process("/sbin/init")||
!try_to_run_init_process("/etc/init")||
!try_to_run_init_process("/bin/init")||
!try_to_run_init_process("/bin/sh"))
return0;
Inotherwaywefinishwithpanic:
panic("Noworkinginitfound.Trypassinginit=optiontokernel."
"SeeLinuxDocumentation/init.txtforguidance.");
That'sall!Linuxkernelinitializationprocessisfinished!
Itistheendofthetenthpartaboutthelinuxkernelinitializationprocess.Anditisnotonlytenthpart,butthisisthelastpartwhichdescribesinitializationofthelinuxkernel.AsIwroteinthefirstpartofthischapter,wewillgothroughallstepsofthekernelinitializationandwedidit.Westartedatthefirstarchitecture-independentfunction-start_kernelandfinishedwiththelaunchofthefirstinitprocessintheoursystem.Imisseddetailsaboutdifferentsubsystemofthekernel,forexampleIalmostdidnotcoverlinuxkernelschedulerorwedidnotseealmostanythingaboutinterruptsandexceptionshandlingandetc...Fromthenextpartwewillstarttodivetothedifferentkernelsubsystems.Hopeitwillbeinteresting.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.Ifyouwillfindany
Conclusion
LinuxInside
160Endofinitialization
mistakespleasesendmePRtolinux-internals.
SLABxsaveFPUDocumentation/security/credentials.txtDocumentation/x86/x86_64/mmRCUVFSinodeprocmanprocSysctlftracecgroupCPUhotplugdocumentationcompletions-waitforcompletionhandlingNUMAcpus/memsinitcallsTmpfsinitrdpanicPreviouspart
Links
LinuxInside
161Endofinitialization
Youwillfindacoupleofpostswhichdescribeinterruptsandexceptionshandlinginthelinuxkernel.
InterruptsandInterruptHandling.Part1.-describesaninterruptshandlingtheory.StarttodiveintointerruptsintheLinuxkernel-thispartstartstodescribeinterruptsandexceptionshandlingrelatedstufffromtheearlystage.Earlyinterrupthandlers-thirdpartdescribesearlyinterrupthandlers.Interrupthandlers-fourthpartdescribesfirstnon-earlyinterrupthandlers.Implementationofexceptionhandlers-descripbesimplementationofsomeexceptionhandlersasdoublefault,dividebyzeroandetc.HandlingNon-Maskableinterrupts-describeshandlingofnon-maskableinterruptsandtherestofinterruptshandlersfromthearchitecture-specificpart.Diveintoexternalhardwareinterrupts-thispartdescribesearlyinitializationofcodewhichisrelatedtohandlingofexternalhardwareinterrupts.Non-earlyinitializationoftheIRQs-thispartdescribesnon-earlyinitializationofcodewhichisrelatedtohandlingofexternalhardwareinterrupts.Softirq,TaskletsandWorkqueues-thispartdescribessoftirqs,taskletsandworkqueuesconcepts.-thisisthelastpartoftheinterruptsandinterrupthandlingchapterandherewewillseearealhardwaredriverandinterruptsrelatedstuff.
InterruptsandInterruptHandling
LinuxInside
162Interrupts
Thisisthefirstpartofthenewchapterofthelinuxinsidesbook.Wehavecomealongwayinthepreviouschapterofthisbook.Westartedfromtheearlieststepsofkernelinitializationandfinishedwiththelaunchofthefirstinitprocess.Yes,wesawseveralinitializationstepswhicharerelatedtothevariouskernelsubsystems.Butwedidnotdigdeepintothedetailsofthesesubsystems.Withthischapter,wewilltrytounderstandhowthevariouskernelsubsystemsworkandhowtheyareimplemented.Asyoucanalreadyunderstandfromthechapter'stitle,thefirstsubsystemwillbeinterrupts.
Wehavealreadyheardofthewordinterruptinseveralpartsofthisbook.Weevensawacoupleofexamplesofinterrupthandlers.Inthecurrentchapterwewillstartfromthetheoryi.e.
Whatareinterrupts?Whatareinterrupthandlers?
WewillthencontinuetodigdeeperintothedetailsofinterruptsandhowtheLinuxkernelhandlesthem.
So...,Firstofallwhatisaninterrupt?AninterruptisaneventwhichisraisedbysoftwareorhardwarewhenitsneedstheCPU'sattention.Forexample,wepressabuttononthekeyboardandwhatdoweexpectnext?Whatshouldtheoperatingsystemandcomputerdoafterthis?TosimplifymattersassumethateachperipheraldevicehasaninterruptlinetotheCPU.AdevicecanuseittosignalaninterrupttotheCPU.HoweverinterruptsarenotsignaleddirectlytotheCPU.IntheoldmachinestherewasaPICwhichisachipresponsibleforsequentiallyprocessingmultipleinterruptrequestsfrommultipledevices.InthenewmachinesthereisanAdvancedProgrammableInterruptControllercommonlyknownas-APIC.AnAPICconsistsoftwoseparatedevices:
LocalAPIC
I/OAPIC
Thefirst-LocalAPICislocatedoneachCPUcore.ThelocalAPICisresponsibleforhandlingtheCPU-specificinterruptconfiguration.ThelocalAPICisusuallyusedtomanageinterruptsfromtheAPIC-timer,thermalsensorandanyothersuchlocallyconnectedI/Odevices.
Thesecond-I/OAPICprovidesmulti-processorinterruptmanagement.ItisusedtodistributeexternalinterruptsamongtheCPUcores.MoreaboutthelocalandI/OAPICswillbecoveredlaterinthischapter.Asyoucanunderstand,interruptscanoccuratanytime.Whenaninterruptoccurs,theoperatingsystemmusthandleitimmediately.Butwhatdoesitmeantohandleaninterrupt?Whenaninterruptoccurs,theoperatingsystemmustensurethefollowingsteps:
Thekernelmustpauseexecutionofthecurrentprocess;(preemptcurrenttask);Thekernelmustsearchforthehandleroftheinterruptandtransfercontrol(executeinterrupthandler);Aftertheinterrupthandlercompletesexecution,theinterruptedprocesscanresumeexecution.
Ofcoursetherearenumerousintricaciesinvolvedinthisprocedureofhandlinginterrupts.Buttheabove3stepsformthebasicskeletonoftheprocedure.
Addressesofeachoftheinterrupthandlersaremaintainedinaspeciallocationreferredtoasthe-InterruptDescriptorTableorIDT.Theprocessorusesauniquenumberforrecognizingthetypeofinterruptionorexception.Thisnumberiscalled-vectornumber.AvectornumberisanindexintheIDT.Thereislimitedamountofthevectornumbersanditcanbefrom0to255.Youcannotethefollowingrange-checkuponthevectornumberwithintheLinuxkernelsource-code:
InterruptsandInterruptHandling.Part1.
Introduction
WhatisanInterrupt?
LinuxInside
163Introduction
BUG_ON((unsigned)n>0xFF);
YoucanfindthischeckwithintheLinuxkernelsourcecoderelatedtointerruptsetup(eg.Theset_intr_gate,voidset_system_intr_gateinarch/x86/include/asm/desc.h).Thefirst32vectornumbersfrom0to31arereservedbytheprocessorandusedfortheprocessingofarchitecture-definedexceptionsandinterrupts.YoucanfindthetablewiththedescriptionofthesevectornumbersinthesecondpartoftheLinuxkernelinitializationprocess-Earlyinterruptandexceptionhandling.Vectornumbersfrom32to255aredesignatedasuser-definedinterruptsandarenotreservedbytheprocessor.TheseinterruptsaregenerallyassignedtoexternalI/Odevicestoenablethosedevicestosendinterruptstotheprocessor.
Nowlet'stalkaboutthetypesofinterrupts.Broadlyspeaking,wecansplitinterruptsinto2majorclasses:
Externalorhardwaregeneratedinterrupts;Software-generatedinterrupts.
Thefirst-externalinterruptsarereceivedthroughtheLocalAPICorpinsontheprocessorwhichareconnectedtotheLocalAPIC.Thesecond-software-generatedinterruptsarecausedbyanexceptionalconditionintheprocessoritself(sometimesusingspecialarchitecture-specificinstructions).Acommonexampleforanexceptionalconditionisdivisionbyzero.Anotherexampleisexitingaprogramwiththesyscallinstruction.
Asmentionedearlier,aninterruptcanoccuratanytimeforareasonwhichthecodeandCPUhavenocontrolover.Ontheotherhand,exceptionsaresynchronouswithprogramexecutionandcanbeclassifiedinto3categories:
Faults
Traps
Aborts
Afaultisanexceptionreportedbeforetheexecutionofa"faulty"instruction(whichcanthenbecorrected).Ifcorrected,itallowstheinterruptedprogramtoberesume.
Nextatrapisanexceptionwhichisreportedimmediatelyfollowingtheexecutionofthetrapinstruction.Trapsalsoallowtheinterruptedprogramtobecontinuedjustasafaultdoes.
Finallyanabortisanexceptionthatdoesnotalwaysreporttheexactinstructionwhichcausedtheexceptionanddoesnotallowtheinterruptedprogramtoberesumed.
Alsowealreadyknowfromthepreviouspartthatinterruptscanbeclassifiedasmaskableandnon-maskable.Maskableinterruptsareinterruptswhichcanbeblockedwiththetwofollowinginstructionsforx86_64-stiandcli.WecanfindthemintheLinuxkernelsourcecode:
staticinlinevoidnative_irq_disable(void)
{
asmvolatile("cli":::"memory");
}
and
staticinlinevoidnative_irq_enable(void)
{
asmvolatile("sti":::"memory");
}
ThesetwoinstructionsmodifytheIFflagbitwithintheinterruptregister.ThestiinstructionsetstheIFflagandthecli
LinuxInside
164Introduction
instructionclearsthisflag.Non-maskableinterruptsarealwaysreported.Usuallyanyfailureinthehardwareismappedtosuchnon-maskableinterrupts.
Ifmultipleexceptionsorinterruptsoccuratthesametime,theprocessorhandlestheminorderoftheirpredefinedpriorities.Wecandeterminetheprioritiesfromthehighesttothelowestinthefollowingtable:
+----------------------------------------------------------------+
|||
|Priority|Description|
|||
+--------------+-------------------------------------------------+
||HardwareResetandMachineChecks|
|1|-RESET|
||-MachineCheck|
+--------------+-------------------------------------------------+
||TraponTaskSwitch|
|2|-TflaginTSSisset|
|||
+--------------+-------------------------------------------------+
||ExternalHardwareInterventions|
||-FLUSH|
|3|-STOPCLK|
||-SMI|
||-INIT|
+--------------+-------------------------------------------------+
||TrapsonthePreviousInstruction|
|4|-Breakpoints|
||-DebugTrapExceptions|
+--------------+-------------------------------------------------+
|5|NonmaskableInterrupts|
+--------------+-------------------------------------------------+
|6|MaskableHardwareInterrupts|
+--------------+-------------------------------------------------+
|7|CodeBreakpointFault|
+--------------+-------------------------------------------------+
|8|FaultsfromFetchingNextInstruction|
||Code-SegmentLimitViolation|
||CodePageFault|
+--------------+-------------------------------------------------+
||FaultsfromDecodingtheNextInstruction|
||Instructionlength>15bytes|
|9|InvalidOpcode|
||CoprocessorNotAvailable|
|||
+--------------+-------------------------------------------------+
|10|FaultsonExecutinganInstruction|
||Overflow|
||Bounderror|
||InvalidTSS|
||SegmentNotPresent|
||Stackfault|
||GeneralProtection|
||DataPageFault|
||AlignmentCheck|
||x87FPUFloating-pointexception|
||SIMDfloating-pointexception|
||Virtualizationexception|
+--------------+-------------------------------------------------+
Nowthatweknowalittleaboutthevarioustypesofinterruptsandexceptions,itistimetomoveontoamorepracticalpart.WestartwiththedescriptionoftheInterruptDescriptorTable.Asmentionedearlier,theIDTstoresentrypointsoftheinterruptsandexceptionshandlers.TheIDTissimilarinstructuretotheGlobalDescriptorTablewhichwesawinthesecondpartoftheKernelbootingprocess.Butofcourseithassomedifferences.Insteadofdescriptors,theIDTentriesarecalledgates.Itcancontainoneofthefollowinggates:
InterruptgatesTaskgatesTrapgates.
LinuxInside
165Introduction
inthex86architecture.Onlylongmodeinterruptgatesandtrapgatescanbereferencedinthex86_64.LiketheGlobalDescriptorTable,theInterruptDescriptortableisanarrayof8-bytegatesonx86andanarrayof16-bytegatesonx86_64.WecanrememberfromthesecondpartoftheKernelbootingprocess,thatGlobalDescriptorTablemustcontainNULLdescriptorasitsfirstelement.UnliketheGlobalDescriptorTable,theInterruptDescriptorTablemaycontainagate;itisnotmandatory.Forexample,youmayrememberthatwehaveloadedtheInterruptDescriptortablewiththeNULLgatesonlyintheearlierpartwhiletransitioningintoprotectedmode:
/*
*SetuptheIDT
*/
staticvoidsetup_idt(void)
{
staticconststructgdt_ptrnull_idt={0,0};
asmvolatile("lidtl%0"::"m"(null_idt));
}
fromthearch/x86/boot/pm.c.TheInterruptDescriptortablecanbelocatedanywhereinthelinearaddressspaceandthebaseaddressofitmustbealignedonan8-byteboundaryonx86or16-byteboundaryonx86_64.ThebaseaddressoftheIDTisstoredinthespecialregister-IDTR.Therearetwoinstructionsonx86-compatibleprocessorstomodifytheIDTRregister:
LIDT
SIDT
ThefirstinstructionLIDTisusedtoloadthebase-addressoftheIDTi.e.thespecifiedoperandintotheIDTR.ThesecondinstructionSIDTisusedtoreadandstorethecontentsoftheIDTRintothespecifiedoperand.TheIDTRregisteris48-bitsonthex86andcontainsthefollowinginformation:
+-----------------------------------+----------------------+
|||
|BaseaddressoftheIDT|LimitoftheIDT|
|||
+-----------------------------------+----------------------+
4716150
Lookingattheimplementationofsetup_idt,wehavepreparedanull_idtandloadedittotheIDTRregisterwiththelidtinstruction.Notethatnull_idthasgdt_ptrtypewhichisdefinedas:
structgdt_ptr{
u16len;
u32ptr;
}__attribute__((packed));
Herewecanseethedefinitionofthestructurewiththetwofieldsof2-bytesand4-byteseach(atotalof48-bits)aswecanseeinthediagram.Nowlet'slookattheIDTentriesstructure.TheIDTentriesstructureisanarrayofthe16-byteentrieswhicharecalledgatesinthex86_64.Theyhavethefollowingstructure:
12796
+-------------------------------------------------------------------------------+
||
|Reserved|
||
+--------------------------------------------------------------------------------
9564
+-------------------------------------------------------------------------------+
||
|Offset63..32|
||
LinuxInside
166Introduction
+-------------------------------------------------------------------------------+
634847464442393432
+-------------------------------------------------------------------------------+
|||D|||||||
|Offset31..16|P|P|0|Type|000|0|0|IST|
|||L|||||||
-------------------------------------------------------------------------------+
3116150
+-------------------------------------------------------------------------------+
|||
|SegmentSelector|Offset15..0|
|||
+-------------------------------------------------------------------------------+
ToformanindexintotheIDT,theprocessorscalestheexceptionorinterruptvectorbysixteen.Theprocessorhandlestheoccurrenceofexceptionsandinterruptsjustlikeithandlescallsofaprocedurewhenitseesthecallinstruction.AprocessorusesanuniquenumberorvectornumberoftheinterruptortheexceptionastheindextofindthenecessaryInterruptDescriptorTableentry.Nowlet'stakeacloserlookatanIDTentry.
Aswecansee,IDTentryonthediagramconsistsofthefollowingfields:
0-15bits-offsetfromthesegmentselectorwhichisusedbytheprocessorasthebaseaddressoftheentrypointoftheinterrupthandler;16-31bits-baseaddressofthesegmentselectwhichcontainstheentrypointoftheinterrupthandler;IST-anewspecialmechanisminthex86_64,willseeitlater;DPL-DescriptorPrivilegeLevel;P-SegmentPresentflag;48-63bits-secondpartofthehandlerbaseaddress;64-95bits-thirdpartofthebaseaddressofthehandler;96-127bits-andthelastbitsarereservedbytheCPU.
AndthelastTypefielddescribesthetypeoftheIDTentry.Therearethreedifferentkindsofhandlersforinterrupts:
InterruptgateTrapgateTaskgate
TheISTorInterruptStackTableisanewmechanisminthex86_64.Itisusedasanalternativetothethelegacystack-switchmechanism.PreviouslyThex86architectureprovidedamechanismtoautomaticallyswitchstackframesinresponsetoaninterrupt.TheISTisamodifiedversionofthex86Stackswitchingmode.ThismechanismunconditionallyswitchesstackswhenitisenabledandcanbeenabledforanyinterruptintheIDTentryrelatedwiththecertaininterrupt(wewillsoonseeit).FromthiswecanunderstandthatISTisnotnecessaryforallinterrupts.Someinterruptscancontinuetousethelegacystackswitchingmode.TheISTmechanismprovidesuptosevenISTpointersintheTaskStateSegmentorTSSwhichisthespecialstructurewhichcontainsinformationaboutaprocess.TheTSSisusedforstackswitchingduringtheexecutionofaninterruptorexceptionhandlerintheLinuxkernel.EachpointerisreferencedbyaninterruptgatefromtheIDT.
TheInterruptDescriptorTablerepresentedbythearrayofthegate_descstructures:
externgate_descidt_table[];
wheregate_descis:
#ifdefCONFIG_X86_64
...
...
...
LinuxInside
167Introduction
typedefstructgate_struct64gate_desc;
...
...
...
#endif
andgate_struct64definedas:
structgate_struct64{
u16offset_low;
u16segment;
unsignedist:3,zero0:5,type:5,dpl:2,p:1;
u16offset_middle;
u32offset_high;
u32zero1;
}__attribute__((packed));
EachactivethreadhasalargestackintheLinuxkernelforthex86_64architecture.ThestacksizeisdefinedasTHREAD_SIZEandisequalto:
#definePAGE_SHIFT12
#definePAGE_SIZE(_AC(1,UL)<<PAGE_SHIFT)
...
...
...
#defineTHREAD_SIZE_ORDER(2+KASAN_STACK_ORDER)
#defineTHREAD_SIZE(PAGE_SIZE<<THREAD_SIZE_ORDER)
ThePAGE_SIZEis4096-bytesandtheTHREAD_SIZE_ORDERdependsontheKASAN_STACK_ORDER.Aswecansee,theKASAN_STACKdependsontheCONFIG_KASANkernelconfigurationparameterandisdefinedas:
#ifdefCONFIG_KASAN
#defineKASAN_STACK_ORDER1
#else
#defineKASAN_STACK_ORDER0
#endif
KASanisaruntimememorydebugger.So...theTHREAD_SIZEwillbe16384bytesifCONFIG_KASANisdisabledor32768ifthiskernelconfigurationoptionisenabled.Thesestackscontainusefuldataaslongasathreadisaliveorinazombiestate.Whilethethreadisinuser-space,thekernelstackisemptyexceptforthethread_infostructure(detailsaboutthisstructureareavailableinthefourthpartoftheLinuxkernelinitializationprocess)atthebottomofthestack.Theactiveorzombiethreadsaren'ttheonlythreadswiththeirownstack.TherealsoexistspecializedstacksthatareassociatedwitheachavailableCPU.ThesestacksareactivewhenthekernelisexecutingonthatCPU.Whentheuser-spaceisexecutingontheCPU,thesestacksdonotcontainanyusefulinformation.EachCPUhasafewspecialper-cpustacksaswell.Thefirstistheinterruptstackusedfortheexternalhardwareinterrupts.Itssizeisdeterminedasfollows:
#defineIRQ_STACK_ORDER(2+KASAN_STACK_ORDER)
#defineIRQ_STACK_SIZE(PAGE_SIZE<<IRQ_STACK_ORDER)
or16384bytes.Theper-cpuinterruptstackrepresentedbytheirq_stack_unionunionintheLinuxkernelforx86_64:
unionirq_stack_union{
charirq_stack[IRQ_STACK_SIZE];
struct{
chargs_base[40];
unsignedlongstack_canary;
LinuxInside
168Introduction
};
};
Thefirstirq_stackfieldisa16kilobytesarray.Alsoyoucanseethatirq_stack_unioncontainsastructurewiththetwofields:
gs_base-Thegsregisteralwayspointstothebottomoftheirqstackunion.Onthex86_64,thegsregisterissharedbyper-cpuareaandstackcanary(moreaboutper-cpuvariablesyoucanreadinthespecialpart).Allper-cpusymbolsarezerobasedandthegspointstothebaseoftheper-cpuarea.Youalreadyknowthatsegmentedmemorymodelisabolishedinthelongmode,butwecansetthebaseaddressforthetwosegmentregisters-fsandgswiththeModelspecificregistersandtheseregisterscanbestillbeusedasaddressregisters.IfyourememberthefirstpartoftheLinuxkernelinitializationprocess,youcanrememberthatwehavesetthegsregister:
movl$MSR_GS_BASE,%ecx
movlinitial_gs(%rip),%eax
movlinitial_gs+4(%rip),%edx
wrmsr
whereinitial_gspointstotheirq_stack_union:
GLOBAL(initial_gs)
.quadINIT_PER_CPU_VAR(irq_stack_union)
stack_canary-Stackcanaryfortheinterruptstackisastackprotectortoverifythatthestackhasn'tbeenoverwritten.Notethatgs_baseisa40bytesarray.GCCrequiresthatstackcanarywillbeonthefixedoffsetfromthebaseofthegsanditsvaluemustbe40forthex86_64and20forthex86.
Theirq_stack_unionisthefirstdatuminthepercpuarea,wecanseeitintheSystem.map:
0000000000000000D__per_cpu_start
0000000000000000Dirq_stack_union
0000000000004000dexception_stacks
0000000000009000Dgdt_page
...
...
...
Wecanseeitsdefinitioninthecode:
DECLARE_PER_CPU_FIRST(unionirq_stack_union,irq_stack_union)__visible;
Now,it'stimetolookattheinitializationoftheirq_stack_union.Besidestheirq_stack_uniondefinition,wecanseethedefinitionofthefollowingper-cpuvariablesinthearch/x86/include/asm/processor.h:
DECLARE_PER_CPU(char*,irq_stack_ptr);
DECLARE_PER_CPU(unsignedint,irq_count);
Thefirstistheirq_stack_ptr.Fromthevariable'sname,itisobviousthatthisisapointertothetopofthestack.Thesecond-irq_countisusedtocheckifaCPUisalreadyonaninterruptstackornot.Initializationoftheirq_stack_ptrislocatedinthesetup_per_cpu_areasfunctioninarch/x86/kernel/setup_percpu.c:
LinuxInside
169Introduction
void__initsetup_per_cpu_areas(void)
{
...
...
#ifdefCONFIG_X86_64
for_each_possible_cpu(cpu){
...
...
...
per_cpu(irq_stack_ptr,cpu)=
per_cpu(irq_stack_union.irq_stack,cpu)+
IRQ_STACK_SIZE-64;
...
...
...
#endif
...
...
}
HerewegooveralltheCPUsone-by-oneandsetupirq_stack_ptr.Thisturnsouttobeequaltothetopoftheinterruptstackminus64.Why64?TODOarch/x86/kernel/cpu/common.csourcecodefileisfollowing:
voidload_percpu_segment(intcpu)
{
...
...
...
loadsegment(gs,0);
wrmsrl(MSR_GS_BASE,(unsignedlong)per_cpu(irq_stack_union.gs_base,cpu));
}
andaswealreadyknowthegsregisterpointstothebottomoftheinterruptstack:
movl$MSR_GS_BASE,%ecx
movlinitial_gs(%rip),%eax
movlinitial_gs+4(%rip),%edx
wrmsr
GLOBAL(initial_gs)
.quadINIT_PER_CPU_VAR(irq_stack_union)
Herewecanseethewrmsrinstructionwhichloadsthedatafromedx:eaxintotheModelspecificregisterpointedbytheecxregister.InourcasethemodelspecificregisterisMSR_GS_BASEwhichcontainsthebaseaddressofthememorysegmentpointedbythegsregister.edx:eaxpointstotheaddressoftheinitial_gswhichisthebaseaddressofourirq_stack_union.
Wealreadyknowthatx86_64hasafeaturecalledInterruptStackTableorISTandthisfeatureprovidestheabilitytoswitchtoanewstackforeventsnon-maskableinterrupt,doublefaultandetc...TherecanbeuptosevenISTentriesper-cpu.Someofthemare:
DOUBLEFAULT_STACK
NMI_STACK
DEBUG_STACK
MCE_STACK
or
#defineDOUBLEFAULT_STACK1
#defineNMI_STACK2
LinuxInside
170Introduction
#defineDEBUG_STACK3
#defineMCE_STACK4
Allinterrupt-gatedescriptorswhichswitchtoanewstackwiththeISTareinitializedwiththeset_intr_gate_istfunction.Forexample:
set_intr_gate_ist(X86_TRAP_NMI,&nmi,NMI_STACK);
...
...
...
set_intr_gate_ist(X86_TRAP_DF,&double_fault,DOUBLEFAULT_STACK);
where&nmiand&double_faultareaddressesoftheentriestothegiveninterrupthandlers:
asmlinkagevoidnmi(void);
asmlinkagevoiddouble_fault(void);
definedinthearch/x86/kernel/entry_64.S
idtentrydouble_faultdo_double_faulthas_error_code=1paranoid=2
...
...
...
ENTRY(nmi)
...
...
...
END(nmi)
Whenaninterruptoranexceptionoccurs,thenewssselectorisforcedtoNULLandthessselector’srplfieldissettothenewcpl.Theoldss,rsp,registerflags,cs,riparepushedontothenewstack.In64-bitmode,thesizeofinterruptstack-framepushesisfixedat8-bytes,sowewillgetthefollowingstack:
+---------------+
||
|SS|40
|RSP|32
|RFLAGS|24
|CS|16
|RIP|8
|Errorcode|0
||
+---------------+
IftheISTfieldintheinterruptgateisnot0,wereadtheISTpointerintorsp.Iftheinterruptvectornumberhasanerrorcodeassociatedwithit,wethenpushtheerrorcodeontothestack.Iftheinterruptvectornumberhasnoerrorcode,wegoaheadandpushthedummyerrorcodeontothestack.Weneedtodothistoensurestackconsistency.Nextweloadthesegment-selectorfieldfromthegatedescriptorintotheCSregisterandmustverifythatthetargetcode-segmentisa64-bitmodecodesegmentbythecheckingbit21i.e.theLbitintheGlobalDescriptorTable.Finallyweloadtheoffsetfieldfromthegatedescriptorintoripwhichwillbetheentry-pointoftheinterrupthandler.Afterthistheinterrupthandlerbeginstoexecute.Afteraninterrupthandlerfinishesitsexecution,itmustreturncontroltotheinterruptedprocesswiththeiretinstruction.Theiretinstructionunconditionallypopsthestackpointer(ss:rsp)torestorethestackoftheinterruptedprocessanddoesnotdependonthecplchange.
That'sall.
LinuxInside
171Introduction
ItistheendofthefirstpartaboutinterruptsandinterrupthandlingintheLinuxkernel.Wesawsometheoryandthefirststepsoftheinitializationofstuffrelatedtointerruptsandexceptions.Inthenextpartwewillcontinuetodiveintointerruptsandinterruptshandling-intothemorepracticalaspectsofit.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmeaPRtolinux-internals.
PICAdvancedProgrammableInterruptControllerprotectedmodelongmodekernelstacksTaskStateSegementsegmentedmemorymodelModelspecificregistersStackcanaryPreviouschapter
Conclusion
Links
LinuxInside
172Introduction
WesawsometheoryaboutinterruptsandexceptionhandlinginthepreviouspartandasIalreadywroteinthatpart,wewillstarttodiveintointerruptsandexceptionsintheLinuxkernelsourcecodeinthispart.Asyoualreadycannote,thepreviouspartmostlydescribedtheoreticalaspectsandinthispartwewillstarttodivedirectlyintotheLinuxkernelsourcecode.Wewillstarttodoitaswediditinotherchapters,fromtheveryearlyplaces.WewillnotseetheLinuxkernelsourcecodefromtheearliestcodelinesaswesawitforexampleintheLinuxkernelbootingprocesschapter,butwewillstartfromtheearliestcodewhichisrelatedtotheinterruptsandexceptions.InthispartwewilltrytogothroughtheallinterruptsandexceptionsrelatedstuffwhichwecanfindintheLinuxkernelsourcecode.
Ifyou'vereadthepreviousparts,youcanrememberthattheearliestplaceintheLinuxkernelx86_64architecture-specifixsourcecodewhichisrelatedtotheinterruptislocatedinthearch/x86/boot/pm.csourcecodefileandrepresentsthefirstsetupoftheInterruptDescriptorTable.Itoccursrightbeforethetransitionintotheprotectedmodeinthego_to_protected_modefunctionbythecallofthesetup_idt:
voidgo_to_protected_mode(void)
{
...
setup_idt();
...
}
Thesetup_idtfunctionisdefinedinthesamesourcecodefileasthego_to_protected_modefunctionandjustloadstheaddressoftheNULLinterruptsdescriptortable:
staticvoidsetup_idt(void)
{
staticconststructgdt_ptrnull_idt={0,0};
asmvolatile("lidtl%0"::"m"(null_idt));
}
wheregdt_ptrrepresentsaspecial48-bitGTDRregisterwhichmustcontainthebaseaddressoftheGlobalDescriptorTable:
structgdt_ptr{
u16len;
u32ptr;
}__attribute__((packed));
Ofcourseinourcasethegdt_ptrdoesnotrepresenttheGDTRregister,butIDTRsincewesetInterruptDescriptorTable.Youwillnotfindanidt_ptrstructure,becauseifithadbeenintheLinuxkernelsourcecode,itwouldhavebeenthesameasgdt_ptrbutwithdifferentname.So,asyoucanunderstandthereisnosensetohavetwosimilarstructureswhichdifferonlybyname.Youcannotehere,thatwedonotfilltheInterruptDescriptorTablewithentries,becauseitistooearlytohandleanyinterruptsorexceptionsatthispoint.That'swhywejustfilltheIDTwithNULL.
AfterthesetupoftheInterruptdescriptortable,GlobalDescriptorTableandotherstuffwejumpintoprotectedmodeinthe-arch/x86/boot/pmjump.S.Youcanreadmoreaboutitinthepartwhichdescribesthetransitiontoprotectedmode.
InterruptsandInterruptHandling.Part2.
StarttodiveintointerruptandexceptionshandlingintheLinuxkernel
LinuxInside
173Starttodiveintointerrupts
Wealreadyknowfromtheearliestpartsthatentrytoprotectedmodeislocatedintheboot_params.hdr.code32_startandyoucanseethatwepasstheentryoftheprotectedmodeandboot_paramstotheprotected_mode_jumpintheendofthearch/x86/boot/pm.c:
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params+(ds()<<4));
Theprotected_mode_jumpisdefinedinthearch/x86/boot/pmjump.Sandgetsthesetwoparametersintheaxanddxregistersusingoneofthe8086callingconventions:
GLOBAL(protected_mode_jump)
...
...
...
.byte0x66,0xea#ljmplopcode
2:.longin_pm32#offset
.word__BOOT_CS#segment
...
...
...
ENDPROC(protected_mode_jump)
wherein_pm32containsajumptothe32-bitentrypoint:
GLOBAL(in_pm32)
...
...
jmpl*%eax//%eaxcontainsaddressofthe`startup_32`
...
...
ENDPROC(in_pm32)
Asyoucanrememberthe32-bitentrypointisinthearch/x86/boot/compressed/head_64.Sassemblyfile,althoughitcontains_64initsname.Wecanseethetwosimilarfilesinthearch/x86/boot/compresseddirectory:
arch/x86/boot/compressed/head_32.S.arch/x86/boot/compressed/head_64.S;
Butthe32-bitmodeentrypointisthesecondfileinourcase.Thefirstfileisnotevencompiledforx86_64.Let'slookatthearch/x86/boot/compressed/Makefile:
vmlinux-objs-y:=$(obj)/vmlinux.lds$(obj)/head_$(BITS).o$(obj)/misc.o\
...
...
Wecanseeherethathead_*dependsonthe$(BITS)variablewhichdependsonthearchitecture.Youcanfinditinthearch/x86/Makefile:
ifeq($(CONFIG_X86_32),y)
...
BITS:=32
else
BITS:=64
...
endif
LinuxInside
174Starttodiveintointerrupts
Nowaswejumpedonthestartup_32fromthearch/x86/boot/compressed/head_64.Swewillnotfindanythingrelatedtotheinterrupthandlinghere.Thestartup_32containscodethatmakespreparationsbeforethetransitionintolongmodeanddirectlyjumpsintoit.Thelongmodeentryislocatedinstartup_64anditmakespreparationsbeforethekerneldecompressionthatoccursinthedecompress_kernelfromthearch/x86/boot/compressed/misc.c.Afterthekernelisdecompressed,wejumponthestartup_64fromthearch/x86/kernel/head_64.S.Inthestartup_64westarttobuildidentity-mappedpages.Afterwehavebuiltidentity-mappedpages,checkedtheNXbit,setuptheExtendedFeatureEnableRegister(seeinlinks),andupdatedtheearlyGlobalDescriptorTablewiththelgdtinstruction,weneedtosetupgsregisterwiththefollowingcode:
movl$MSR_GS_BASE,%ecx
movlinitial_gs(%rip),%eax
movlinitial_gs+4(%rip),%edx
wrmsr
Wealreadysawthiscodeinthepreviouspart.Firstofallpayattentiononthelastwrmsrinstruction.Thisinstructionwritesdatafromtheedx:eaxregisterstothemodelspecificregisterspecifiedbytheecxregister.Wecanseethatecxcontains$MSR_GS_BASEwhichisdeclaredinthearch/x86/include/uapi/asm/msr-index.handlookslike:
#defineMSR_GS_BASE0xc0000101
FromthiswecanunderstandthatMSR_GS_BASEdefinesthenumberofthemodelspecificregister.Sinceregisterscs,ds,es,andssarenotusedinthe64-bitmode,theirfieldsareignored.Butwecanaccessmemoryoverfsandgsregisters.Themodelspecificregisterprovidesabackdoortothehiddenpartsofthesesegmentregistersandallowstouse64-bitbaseaddressforsegmentregisteraddressedbythefsandgs.SotheMSR_GS_BASEisthehiddenpartandthispartismappedontheGS.basefield.Let'slookontheinitial_gs:
GLOBAL(initial_gs)
.quadINIT_PER_CPU_VAR(irq_stack_union)
Wepassirq_stack_unionsymboltotheINIT_PER_CPU_VARmacrowhichjustconcatenatestheinit_per_cpu__prefixwiththegivensymbol.Inourcasewewillgettheinit_per_cpu__irq_stack_unionsymbol.Let'slookatthelinkerscript.Therewecanseefollowingdefinition:
#defineINIT_PER_CPU(x)init_per_cpu__##x=x+__per_cpu_load
INIT_PER_CPU(irq_stack_union);
Ittellsusthattheaddressoftheinit_per_cpu__irq_stack_unionwillbeirq_stack_union+__per_cpu_load.Nowweneedtounderstandwhereinit_per_cpu__irq_stack_unionand__per_cpu_loadareandwhattheymean.Thefirstirq_stack_unionisdefinedinthearch/x86/include/asm/processor.hwiththeDECLARE_INIT_PER_CPUmacrowhichexpandstocalltheinit_per_cpu_varmacro:
DECLARE_INIT_PER_CPU(irq_stack_union);
#defineDECLARE_INIT_PER_CPU(var)\
externtypeof(per_cpu_var(var))init_per_cpu_var(var)
#defineinit_per_cpu_var(var)init_per_cpu__##var
Ifweexpandallmacroswewillgetthesameinit_per_cpu__irq_stack_unionaswegotafterexpandingtheINIT_PER_CPUmacro,butyoucannotethatitisnotjustasymbol,butavariable.Let'slookatthetypeof(per_cpu_var(var))expression.Ourvarisirq_stack_unionandtheper_cpu_varmacroisdefinedinthearch/x86/include/asm/percpu.h:
LinuxInside
175Starttodiveintointerrupts
#definePER_CPU_VAR(var)%__percpu_seg:var
where:
#ifdefCONFIG_X86_64
#define__percpu_seggs
endif
So,weareaccessinggs:irq_stack_unionandgetingitstypewhichisirq_union.Ok,wedefinedthefirstvariableandknowitsaddress,nowlet'slookatthesecond__per_cpu_loadsymbol.Thereareacoupleofper-cpuvariableswhicharelocatedafterthissymbol.The__per_cpu_loadisdefinedintheinclude/asm-generic/sections.h:
externchar__per_cpu_load[],__per_cpu_start[],__per_cpu_end[];
andpresentedbaseaddressoftheper-cpuvariablesfromthedataarea.So,weknowtheaddressoftheirq_stack_union,__per_cpu_loadandweknowthatinit_per_cpu__irq_stack_unionmustbeplacedrightafter__per_cpu_load.AndwecanseeitintheSystem.map:
...
...
...
ffffffff819ed000D__init_begin
ffffffff819ed000D__per_cpu_load
ffffffff819ed000Ainit_per_cpu__irq_stack_union
...
...
...
Nowweknowaboutinitial_gs,solet'slookatthecode:
movl$MSR_GS_BASE,%ecx
movlinitial_gs(%rip),%eax
movlinitial_gs+4(%rip),%edx
wrmsr
HerewespecifiedamodelspecificregisterwithMSR_GS_BASE,putthe64-bitaddressoftheinitial_gstotheedx:eaxpairandexecutethewrmsrinstructionforfillingthegsregisterwiththebaseaddressoftheinit_per_cpu__irq_stack_unionwhichwillbeatthebottomoftheinterruptstack.AfterthiswewilljumptotheCcodeonthex86_64_start_kernelfromthearch/x86/kernel/head64.c.Inthex86_64_start_kernelfunctionwedothelastpreparationsbeforewejumpintothegenericandarchitecture-independentkernelcodeandoneofthesepreparationsisfillingtheearlyInterruptDescriptorTablewiththeinterruptshandlersentriesorearly_idt_handlers.Youcanrememberit,ifyouhavereadthepartabouttheEarlyinterruptandexceptionhandlingandcanrememberfollowingcode:
for(i=0;i<NUM_EXCEPTION_VECTORS;i++)
set_intr_gate(i,early_idt_handlers[i]);
load_idt((conststructdesc_ptr*)&idt_descr);
butIwroteEarlyinterruptandexceptionhandlingpartwhenLinuxkernelversionwas-3.18.ForthisdayactualversionoftheLinuxkernelis4.1.0-rc6+andAndyLutomirskisentthepatchandsoonitwillbeinthemainlinekernelthatchangesbehaviourfortheearly_idt_handlers.NOTEWhileIwrotethispartthepatchalreadyturnedintheLinuxkernelsourcecode.Let'slookonit.Nowthesamepartlookslike:
LinuxInside
176Starttodiveintointerrupts
for(i=0;i<NUM_EXCEPTION_VECTORS;i++)
set_intr_gate(i,early_idt_handler_array[i]);
load_idt((conststructdesc_ptr*)&idt_descr);
ASyoucanseeithasonlyonedifferenceinthenameofthearrayoftheinterruptshandlersentrypoints.Nowitisearly_idt_handler_arry:
externconstcharearly_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
whereNUM_EXCEPTION_VECTORSandEARLY_IDT_HANDLER_SIZEaredefinedas:
#defineNUM_EXCEPTION_VECTORS32
#defineEARLY_IDT_HANDLER_SIZE9
So,theearly_idt_handler_arrayisanarrayoftheinterruptshandlersentrypointsandcontainsoneentrypointoneveryninebytes.Youcanrememberthatpreviousearly_idt_handlerswasdefinedinthearch/x86/kernel/head_64.S.Theearly_idt_handler_arrayisdefinedinthesamesourcecodefiletoo:
ENTRY(early_idt_handler_array)
...
...
...
ENDPROC(early_idt_handler_common)
Itfillsearly_idt_handler_arrywiththe.reptNUM_EXCEPTION_VECTORSandcontainsentryoftheearly_make_pgtableinterrupthandler(moreaboutitsimplementationyoucanreadinthepartaboutEarlyinterruptandexceptionhandling).Fornowwecometotheendofthex86_64architecture-specificcodeandthenextpartisthegenerickernelcode.Ofcourseyoualreadycanknowthatwewillreturntothearchitecture-specificcodeinthesetup_archfunctionandotherplaces,butthisistheendofthex86_64earlycode.
Thenextstopafterthearch/x86/kernel/head_64.Sisthebiggeststart_kernelfunctionfromtheinit/main.c.Ifyou'vereadthepreviouschapterabouttheLinuxkernelinitializationprocess,youmustrememberit.Thisfunctiondoesallinitializationstuffbeforekernelwilllaunchfirstinitprocesswiththepid-1.Thefirstthingthatisrelatedtotheinterruptsandexceptionshandlingisthecalloftheboot_init_stack_canaryfunction.
Thisfunctionsetsthecanaryvaluetoprotectinterruptstackoverflow.Wealreadysawalittlesomedetailsaboutimplementationoftheboot_init_stack_canaryinthepreviouspartandnowlet'stakeacloserlookonit.Youcanfindimplementationofthisfunctioninthearch/x86/include/asm/stackprotector.handitsdependsontheCONFIG_CC_STACKPROTECTORkernelconfigurationoption.Ifthisoptionisnotsetthisfunctionwillnotdoanything:
#ifdefCONFIG_CC_STACKPROTECTOR
...
...
...
#else
staticinlinevoidboot_init_stack_canary(void)
{
}
#endif
Settingstackcanaryfortheinterruptstack
LinuxInside
177Starttodiveintointerrupts
IftheCONFIG_CC_STACKPROTECTORkernelconfigurationoptionisset,theboot_init_stack_canaryfunctionstartsfromthecheckstatirq_stack_unionthatrepresentsper-cpuinterruptstackhasoffsetequaltofortybytesfromthestack_canaryvalue:
#ifdefCONFIG_X86_64
BUILD_BUG_ON(offsetof(unionirq_stack_union,stack_canary)!=40);
#endif
Aswecanreadinthepreviousparttheirq_stack_unionrepresentedbythefollowingunion:
unionirq_stack_union{
charirq_stack[IRQ_STACK_SIZE];
struct{
chargs_base[40];
unsignedlongstack_canary;
};
};
whichdefinedinthearch/x86/include/asm/processor.h.WeknowthatuniounintheCprogramminglanguageisadatastructurewhichstoresonlyonefieldinamemory.Wecanseeherethatstructurehasfirstfield-gs_basewhichis40bytessizeandrepresentsbottomoftheirq_stack.So,afterthisourcheckwiththeBUILD_BUG_ONmacroshouldendsuccessfully.(youcanreadthefirstpartaboutLinuxkernelinitializationprocessifyou'reinterestingabouttheBUILD_BUG_ONmacro).
AfterthiswecalculatenewcanaryvaluebasedontherandomnumberandTimeStampCounter:
get_random_bytes(&canary,sizeof(canary));
tsc=__native_read_tsc();
canary+=tsc+(tsc<<32UL);
andwritecanaryvaluetotheirq_stack_unionwiththethis_cpu_writemacro:
this_cpu_write(irq_stack_union.stack_canary,canary);
moreaboutthis_cpu_*operationyoucanreadintheLinuxkerneldocumentation.
Thenextstepintheinit/main.cwhichisrelatedtotheinterruptsandinterruptshandlingafterwehavesetthecanaryvaluetotheinterruptstack-isthecallofthelocal_irq_disablemacro.
Thismacrodefinedintheinclude/linux/irqflags.hheaderfileandasyoucanunderstand,wecandisableinterruptsfortheCPUwiththecallofthismacro.Let'slookonitsimplementation.FirstofallnotethatitdependsontheCONFIG_TRACE_IRQFLAGS_SUPPORTkernelconfigurationoption:
#ifdefCONFIG_TRACE_IRQFLAGS_SUPPORT
...
#definelocal_irq_disable()\
do{raw_local_irq_disable();trace_hardirqs_off();}while(0)
...
#else
...
Disabling/Enablinglocalinterrupts
LinuxInside
178Starttodiveintointerrupts
#definelocal_irq_disable()do{raw_local_irq_disable();}while(0)
...
#endif
Theyarebothsimilarandasyoucanseehaveonlyonedifference:thelocal_irq_disablemacrocontainscallofthetrace_hardirqs_offwhenCONFIG_TRACE_IRQFLAGS_SUPPORTisenabled.Thereisspecialfeatureinthelockdepsubsystem-irq-flagstracingfortracinghardirqandstoftirqstate.Inourcaselockdepsubsytemcangiveusinterestinginformationabouthard/softirqson/offeventswhichareoccursinthesystem.Thetrace_hardirqs_offfunctiondefinedinthekernel/locking/lockdep.c:
voidtrace_hardirqs_off(void)
{
trace_hardirqs_off_caller(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_off);
andjustcallstrace_hardirqs_off_callerfunction.Thetrace_hardirqs_off_callerchecksthehardirqs_enabledfiledofthecurrentprocessincrementtheredundant_hardirqs_offifcallofthelocal_irq_disablewasredundantorthehardirqs_off_eventsifitwasnot.Thesetwofieldsandotherlockdepstatisticrelatedfieldsaredefinedinthekernel/locking/lockdep_internals.handlocatedinthelockdep_statsstructure:
structlockdep_stats{
...
...
...
intsoftirqs_off_events;
intredundant_softirqs_off;
...
...
...
}
IfyouwillsetCONFIG_DEBUG_LOCKDEPkernelconfigurationoption,thelockdep_stats_debug_showfunctionwillwritealltracinginformationtothe/proc/lockdep:
staticvoidlockdep_stats_debug_show(structseq_file*m)
{
#ifdefCONFIG_DEBUG_LOCKDEP
unsignedlonglonghi1=debug_atomic_read(hardirqs_on_events),
hi2=debug_atomic_read(hardirqs_off_events),
hr1=debug_atomic_read(redundant_hardirqs_on),
...
...
...
seq_printf(m,"hardirqonevents:%11llu\n",hi1);
seq_printf(m,"hardirqoffevents:%11llu\n",hi2);
seq_printf(m,"redundanthardirqons:%11llu\n",hr1);
#endif
}
andyoucanseeitsresultwiththe:
$sudocat/proc/lockdep
hardirqonevents:12838248974
hardirqoffevents:12838248979
redundanthardirqons:67792
redundanthardirqoffs:3836339146
softirqonevents:38002159
softirqoffevents:38002187
redundantsoftirqons:0
LinuxInside
179Starttodiveintointerrupts
redundantsoftirqoffs:0
Ok,nowweknowalittleabouttracing,butmoreinfowillbeintheseparatepartaboutlockdepandtracing.Youcanseethatthebothlocal_disable_irqmacroshavethesamepart-raw_local_irq_disable.Thismacrodefinedinthearch/x86/include/asm/irqflags.handexpandstothecallofthe:
staticinlinevoidnative_irq_disable(void)
{
asmvolatile("cli":::"memory");
}
AndyoualreadymustrememberthatcliinstructionclearstheIFflagwhichdeterminesabilityofaprocessortohandleandinterruptoranexception.Besidesthelocal_irq_disable,asyoualreadycanknowthereisaninversemacr-local_irq_enable.Thismacrohasthesametracingmechanismandverysimilaronthelocal_irq_enable,butasyoucanunderstandfromitsname,itenablesinterruptswiththestiinstruction:
staticinlinevoidnative_irq_enable(void)
{
asmvolatile("sti":::"memory");
}
Nowweknowhowlocal_irq_disableandlocal_irq_enablework.Itwasthefirstcallofthelocal_irq_disablemacro,butwewillmeetthesemacrosmanytimesintheLinuxkernelsourcecode.Butfornowweareinthestart_kernelfunctionfromtheinit/main.candwejustdisabledlocalinterrupts.Whylocalandwhywedidit?Previouslykernelprovidedamethodtodisableinterruptsonallprocessorsanditwascalledcli.Thisfunctionwasremovedandnowwehavelocal_irq_{enabled,disable}todisableorenableinterruptsonthecurrentprocessor.Afterwe'vedisabledtheinterruptswiththelocal_irq_disablemacro,wesetthe:
early_boot_irqs_disabled=true;
Theearly_boot_irqs_disabledvariabledefinedintheinclude/linux/kernel.h:
externboolearly_boot_irqs_disabled;
andusedinthedifferentplaces.Forexampleitusedinthesmp_call_function_manyfunctionfromthekernel/smp.cforthecheckingpossibledeadlockwheninterruptsaredisabled:
WARN_ON_ONCE(cpu_online(this_cpu)&&irqs_disabled()
&&!oops_in_progress&&!early_boot_irqs_disabled);
Thenextfunctionsafterthelocal_disable_irqareboot_cpu_initandpage_address_init,buttheyarenotrelatedtotheinterruptsandexceptions(moreaboutthisfunctionsyoucanreadinthechapteraboutLinuxkernelinitializationprocess).Thenextisthesetup_archfunction.Asyoucanrememberthisfunctionlocatedinthearch/x86/kernel/setup.csourcecodefileandmakesinitializationofmanydifferentarchitecture-dependentstuff.Thefirstinterruptsrelatedfunctionwhichwecanseeinthesetup_archisthe-early_trap_initfunction.Thisfunctiondefinedinthearch/x86/kernel/traps.candfillsInterruptDescriptorTablewiththecoupleofentries:
Earlytrapinitializationduringkernelinitialization
LinuxInside
180Starttodiveintointerrupts
void__initearly_trap_init(void)
{
set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);
#ifdefCONFIG_X86_32
set_intr_gate(X86_TRAP_PF,page_fault);
#endif
load_idt(&idt_descr);
}
Herewecanseecallsofthreedifferentfunctions:
set_intr_gate_ist
set_system_intr_gate_ist
set_intr_gate
Allofthesefunctionsdefinedinthearch/x86/include/asm/desc.handdothesimilarthingbutnotthesame.Thefirstset_intr_gate_istfunctioninsertsnewaninterruptgateintheIDT.Let'slookonitsimplementation:
staticinlinevoidset_intr_gate_ist(intn,void*addr,unsignedist)
{
BUG_ON((unsigned)n>0xFF);
_set_gate(n,GATE_INTERRUPT,addr,0,ist,__KERNEL_CS);
}
Firstofallwecanseethecheckthatnwhichisvectornumberoftheinterruptisnotgreaterthan0xffor255.Weneedtocheckitbecausewerememberfromthepreviouspartthatvectornumberofaninterruptmustbebetween0and255.Inthenextstepwecanseethecallofthe_set_gatefunctionthatsetsagiveninterruptgatetotheIDTtable:
staticinlinevoid_set_gate(intgate,unsignedtype,void*addr,
unsigneddpl,unsignedist,unsignedseg)
{
gate_descs;
pack_gate(&s,type,(unsignedlong)addr,dpl,ist,seg);
write_idt_entry(idt_table,gate,&s);
write_trace_idt_entry(gate,&s);
}
Herewestartfromthepack_gatefunctionwhichtakescleanIDTentryrepresentedbythegate_descstructureandfillsitwiththebaseaddressandlimit,InterruptStackTable,Privilegelevel,typeofaninterruptwhichcanbeoneofthefollowingvalues:
GATE_INTERRUPT
GATE_TRAP
GATE_CALL
GATE_TASK
andsetthepresentbitforthegivenIDTentry:
staticinlinevoidpack_gate(gate_desc*gate,unsignedtype,unsignedlongfunc,
unsigneddpl,unsignedist,unsignedseg)
{
gate->offset_low=PTR_LOW(func);
gate->segment=__KERNEL_CS;
gate->ist=ist;
gate->p=1;
gate->dpl=dpl;
gate->zero0=0;
LinuxInside
181Starttodiveintointerrupts
gate->zero1=0;
gate->type=type;
gate->offset_middle=PTR_MIDDLE(func);
gate->offset_high=PTR_HIGH(func);
}
AfterthiswewritejustfilledinterruptgatetotheIDTwiththewrite_idt_entrymacrowhichexpandstothenative_write_idt_entryandjustcopytheinterruptgatetotheidt_tabletablebythegivenindex:
#definewrite_idt_entry(dt,entry,g)native_write_idt_entry(dt,entry,g)
staticinlinevoidnative_write_idt_entry(gate_desc*idt,intentry,constgate_desc*gate)
{
memcpy(&idt[entry],gate,sizeof(*gate));
}
whereidt_tableisjustarrayofgate_desc:
externgate_descidt_table[];
That'sall.Thesecondset_system_intr_gate_istfunctionhasonlyonedifferencefromtheset_intr_gate_ist:
staticinlinevoidset_system_intr_gate_ist(intn,void*addr,unsignedist)
{
BUG_ON((unsigned)n>0xFF);
_set_gate(n,GATE_INTERRUPT,addr,0x3,ist,__KERNEL_CS);
}
Doyouseeit?Lookonthefourthparameterofthe_set_gate.Itis0x3.Intheset_intr_gateitwas0x0.WeknowthatthisparameterrepresentDPLorprivilegelevel.Wealsoknowthat0isthehighestprivilgeleveland3isthelowest.Nowweknowhowset_system_intr_gate_ist,set_intr_gate_ist,set_intr_gateareworkandwecanreturntotheearly_trap_initfunction.Let'slookonitagain:
set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);
WesettwoIDTentriesforthe#DBinterruptandint3.Thesefunctionstakesthesamesetofparameters:
vectornumberofaninterrupt;addressofaninterrupthandler;interruptstacktableindex.
That'sall.Moreaboutinterruptsandhandlersyouwillknowinthenextparts.
ItistheendofthesecondpartaboutinterruptsandinterrupthandlingintheLinuxkernel.Wesawthesometheoryinthepreviouspartandstartedtodiveintointerruptsandexceptionshandlinginthecurrentpart.WehavestartedfromtheearliestpartsintheLinuxkernelsourcecodewhicharerelatedtotheinterrupts.Inthenextpartwewillcontinuetodiveintothisinterestingthemeandwillknowmoreaboutinterrupthandlingprocess.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
Conclusion
LinuxInside
182Starttodiveintointerrupts
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
IDTProtectedmodeListofx86callingconventions8086LongmodeNXExtendedFeatureEnableRegisterModel-specificregisterProcessidentifierlockdepirqflagstracingIFStackcanaryUniontypethiscpu*operationsvectornumberInterruptStackTablePrivilegelevelPreviouspart
Links
LinuxInside
183Starttodiveintointerrupts
Thisisthethirdpartofthechapteraboutaninterruptsandanexceptionshandlingandinthepreviouspartwestopedinthesetup_archfunctionfromthearch/x86/kernel/setup.conthesettingofthetwoexceptionshandlersforthetwofollowingexceptions:
#DB-debugexception,transferscontrolfromtheinterruptedprocesstothedebughandler;#BP-breakpointexception,causedbytheint3instruction.
Theseexceptionsallowthex86_64architecturetohaveearlyexceptionprocessingforthepurposeofdebuggingviathekgdb.
Asyoucanrememberwesettheseexceptionshandlersintheearly_trap_initfunction:
void__initearly_trap_init(void)
{
set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);
load_idt(&idt_descr);
}
fromthearch/x86/kernel/traps.c.Wealreadysawimplementationoftheset_intr_gate_istandset_system_intr_gate_istfunctionsinthepreviouspartandnowwewilllookontheimplementationoftheseearlyexceptionshandlers.
Ok,wesettheinterruptsgatesintheearly_trap_initfunctionforthe#DBand#BPexceptionsandnowtimeistolookontheirhandlers.Butfirstofalllet'slookontheseexceptions.Thefirstexceptions-#DBordebugexceptionoccurswhenadebugeventoccurs,forexampleattempttochangethecontentsofadebugregister.DebugregistersarespecialregisterswhichpresentinprocessorsstartingfromtheIntel80386andasyoucanunderstandfromitsnametheyareusedfordebugging.Theseregistersallowtosetbreakpointsonthecodeandreadorwritedatatotrace,thustrackingtheplaceoferrors.Thedebugregistersareprivilegedresourcesavailableandtheprogramineitherreal-addressorprotectedmodeatCPLis0,that'swhywehaveusedset_intr_gate_istforthe#DB,butnottheset_system_intr_gate_ist.Theverctornumberofthe#DBexceptionsis1(wepassitasX86_TRAP_DB)andhasnoerrorcode:
----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description|Type|ErrorCode|Source|
----------------------------------------------------------------------------------------------
|1|#DB|Reserved|F/T|NO||
----------------------------------------------------------------------------------------------
Thesecondis#BPorbreakpointexceptionoccurswhenprocessorexecutestheINT3instruction.Wecanadditanywhereinourcode,forexamplelet'slookonthesimpleprogram:
//breakpoint.c
#include<stdio.h>
intmain(){
inti;
while(i<6){
InterruptsandInterruptHandling.Part3.
Interrupthandlers
DebugandBreakpointexceptions
LinuxInside
184Interrupthandlers
printf("iequalto:%d\n",i);
__asm__("int3");
++i;
}
}
Ifwewillcompileandrunthisprogram,wewillseefollowingoutput:
$gccbreakpoint.c-obreakpoint
iequalto:0
Trace/breakpointtrap
Butifwillrunitwithgdb,wewillseeourbreakpointandcancontinueexecutionofourprogram:
$gdbbreakpoint
...
...
...
(gdb)run
Startingprogram:/home/alex/breakpoints
iequalto:0
ProgramreceivedsignalSIGTRAP,Trace/breakpointtrap.
0x0000000000400585inmain()
=>0x0000000000400585<main+31>:8345fc01addDWORDPTR[rbp-0x4],0x1
(gdb)c
Continuing.
iequalto:1
ProgramreceivedsignalSIGTRAP,Trace/breakpointtrap.
0x0000000000400585inmain()
=>0x0000000000400585<main+31>:8345fc01addDWORDPTR[rbp-0x4],0x1
(gdb)c
Continuing.
iequalto:2
ProgramreceivedsignalSIGTRAP,Trace/breakpointtrap.
0x0000000000400585inmain()
=>0x0000000000400585<main+31>:8345fc01addDWORDPTR[rbp-0x4],0x1
...
...
...
Nowweknowalittleaboutthesetwoexceptionsandwecanmoveontoconsiderationoftheirhandlers.
Asyoucannote,theset_intr_gate_istandset_system_intr_gate_istfunctionstakesanaddressesoftheexceptionshandlersinthesecondparameter:
&debug;&int3.
YouwillnotfindthesefunctionsintheCcode.Allthatcanbefoundininthe*.c/*.hfilesonlydefinitionofthisfunctionsinthearch/x86/include/asm/traps.h:
asmlinkagevoiddebug(void);
asmlinkagevoidint3(void);
Butwecanseeasmlinkagedescriptorhere.Theasmlinkageisthespecialspecificatorofthegcc.ActuallyforaC
Preparationbeforeaninterrupthandler
LinuxInside
185Interrupthandlers
functionswhicharewillbecalledfromassembly,weneedinexplicitdeclarationofthefunctioncallingconvention.Inourcase,iffunctionmakedwithasmlinkagedescriptor,thengccwillcompilethefunctiontoretrieveparametersfromstack.So,bothhandlersaredefinedinthearch/x86/kernel/entry_64.Sassemblysourcecodefilewiththeidtentrymacro:
idtentrydebugdo_debughas_error_code=0paranoid=1shift_ist=DEBUG_STACK
idtentryint3do_int3has_error_code=0paranoid=1shift_ist=DEBUG_STACK
Actuallydebugandint3arenotinterruptshandlers.Rememberthatbeforewecanexecuteaninterrupt/exceptionhandler,weneedtodosomepreparationsas:
Whenaninterruptorexceptionoccured,theprocessorusesanexceptionorinterruptvectorasanindextoadescriptorintheIDT;Inlegacymodess:espregistersarepushedonthestackonlyifprivilegelevelchanged.In64-bitmodess:rsppushedonthestackeverytime;DuringstackswitchingwithISTthenewssselectorisforcedtonull.Oldssandrsparepushedonthenewstack.Therflags,cs,ripanderrorcodepushedonthestack;Controltransferedtoaninterrupthandler;Afteraninterrupthandlerwillfinishitsworkandfinisheswiththeiretinstruction,oldsswillbepopedfromthestackandloadedtothessregister.ss:rspwillbepoppedfromthestackunconditionallyinthe64-bitmodeandwillbepoppedonlyifthereisaprivilegelevelchangeinlegacymode.iretinstructionwillrestorerip,csandrflags;Interruptedprogramwillcontinueitsexecution.
+--------------------+
+40|ss|
+32|rsp|
+24|rflags|
+16|cs|
+8|rip|
0|errorcode|
+--------------------+
Nowwecanseeonthepreparationsbeforeaprocesswilltransfercontroltoaninterrupt/exceptionhandlerfrompracticalside.AsIalreadywroteabovethefirstthirteenexceptionshandlersdefinedinthearch/x86/kernel/entry_64.Sassemblyfilewiththeidtentrymacro:
.macroidtentrysymdo_symhas_error_code:reqparanoid=0shift_ist=-1
ENTRY(\sym)
...
...
...
END(\sym)
.endm
Thismacrodefinesanexceptionentrypointandaswecanseeittakesfivearguments:
sym-definesglobalsymbolwiththe.globlname.do_sym-aninterrupthandler.has_error_code:req-informationabouterrorcode,The:reqqualifiertellstheassemblerthattheargumentisrequired;paranoid-showsushowweneedtocheckcurrentmode;shift_ist-showsuswhat'sstacktouse;
Aswecanseeourexceptionshandlersarealmostthesame:
LinuxInside
186Interrupthandlers
idtentrydebugdo_debughas_error_code=0paranoid=1shift_ist=DEBUG_STACK
idtentryint3do_int3has_error_code=0paranoid=1shift_ist=DEBUG_STACK
Thedifferencesareonlyintheglobalnameandnameofexceptionshandlers.Nowlet'slookhowidtentrymacroimplemented.Itstartsfromthetwochecks:
.if\shift_ist!=-1&&\paranoid==0
.error"usingshift_istrequiresparanoid=1"
.endif
.if\has_error_code
XCPT_FRAME
.else
INTR_FRAME
.endif
FirstcheckmakesthecheckthatanexceptionsusesInterruptstacktableandparanoidisset,inotherwayitemitstheerorrwiththe.errordirective.ThesecondifclausechecksexistenceofanerrorcodeandcallsXCPT_FRAMEorINTR_FRAMEmacrosdependsonit.ThesemacrosjustexpandtothesetofCFIdirectiveswhichareusedbyGNUAStomanagecallframes.TheCFIdirectivesareusedonlytogeneratedwarf2unwindinformationforbetterbacktracesandtheydon'tchangeanycode,sowewillnotgointodetailaboutitandfromthispointIwillskipallcodewhichisrelatedtothesedirectives.Inthenextstepwecheckerrorcodeagainandpushitonthestackifanexceptionhasitwiththe:
.ifeq\has_error_code
pushq_cfi$-1
.endif
Thepushq_cfimacrodefinedinthearch/x86/include/asm/dwarf2.handexpandstothepushqinstructionwhichpushesgivenerrorcode:
.macropushq_cfireg
pushq\reg
CFI_ADJUST_CFA_OFFSET8
.endm
Payattentiononthe$-1.Wealreadyknowthatwhenanexceptionoccrus,theprocessorpushesss,rsp,rflags,csandriponthestack:
#defineRIP16*8
#defineCS17*8
#defineEFLAGS18*8
#defineRSP19*8
#defineSS20*8
Withthepushq\regwedenotethatplacebeforetheRIPwillcontainerrorcodeofanexception:
#defineORIG_RAX15*8
TheORIG_RAXwillcontainerrorcodeofanexception,IRQnumberonahardwareinterruptandsystemcallnumberonsystemcallentry.InthenextstepwecanseethrALLOC_PT_GPREGS_ON_STACKmacrowhichallocatesspaceforthe15generalpurposeregistersonthestack:
LinuxInside
187Interrupthandlers
.macroALLOC_PT_GPREGS_ON_STACKaddskip=0
subq$15*8+\addskip,%rsp
CFI_ADJUST_CFA_OFFSET15*8+\addskip
.endm
AfterthiswecheckparanoidandifitissetwecheckfirstthreeCPLbits.Wecompareitwiththe3anditallowsustoknowdidwecomefromuserspaceornot:
.if\paranoid
.if\paranoid==1
CFI_REMEMBER_STATE
testl$3,CS(%rsp)
jnz1f
.endif
callparanoid_entry
.else
callerror_entry
.endif
Ifwecamefromuserspacewejumponthelabel1whichstartsfromthecallerror_entryinstruction.Theerror_entrysavesallregistersinthept_regsstructurewhichpresetensaninterrupt/exceptionstackframeanddefinedinthearch/x86/include/uapi/asm/ptrace.h.Itsavescommonandextraregistersonthestackwiththe:
SAVE_C_REGS8
SAVE_EXTRA_REGS8
fromrditor15andexecutesswapgsinstruction.ThisinstructionprovidesamethodtofortheLinuxkerneltoobtainapointertothekerneldatastructuresandsavetheuser'sgsbase.Afterthiswewillexitfromtheerror_entrywiththeretinstruction.Aftertheerror_entryfinishedtoexecute,sincewecamefromuserspaceweneedtoswitchonkernelinterruptstack:
movq%rsp,%rdi
callsync_regs
Wejustsaveallregisterstotheerror_entryintheerror_entry,weputaddressofthept_regstotherdiandcallsync_regsfunctionfromthearch/x86/kernel/traps.c:
asmlinkage__visiblenotracestructpt_regs*sync_regs(structpt_regs*eregs)
{
structpt_regs*regs=task_pt_regs(current);
*regs=*eregs;
returnregs;
}
ThisfunctionswitchsofftheISTstackifwecamefromusermode.Afterthisweswitchonthestackwhichwegotfromthesync_regs:
movq%rax,%rsp
movq%rsp,%rdi
andputpointerofthept_regsagainintherdi,andinthelaststepwecallanexceptionhandler:
call\do_sym
LinuxInside
188Interrupthandlers
So,realyexceptionshandlersaredo_debuganddo_int3functions.Wewillseethesefunctioninthispart,butlittlelater.Firstofalllet'slookonthepreparationsbeforeaprocessorwilltransfercontroltoaninterrupthandler.Inanotherwayifparanoidisset,butitisnot1,wecallparanoid_entrywhichmakesalmostthesamethaterror_entry,butitcheckscurrentmodewithmoreslowbutaccurateway:
ENTRY(paranoid_entry)
SAVE_C_REGS8
SAVE_EXTRA_REGS8
...
...
movl$MSR_GS_BASE,%ecx
rdmsr
testl%edx,%edx
js1f/*negative->inkernel*/
SWAPGS
...
...
ret
END(paranoid_entry)
Ifedxwllbenegative,weareinthekernelmode.Aswestoreallregistersonthestack,checkthatweareinthekernelmode,weneedtosetupISTstackifitissetforagivenexception,callanexceptionhandlerandrestoretheexceptionstack:
.if\shift_ist!=-1
subq$EXCEPTION_STKSZ,CPU_TSS_IST(\shift_ist)
.endif
call\do_sym
.if\shift_ist!=-1
addq$EXCEPTION_STKSZ,CPU_TSS_IST(\shift_ist)
.endif
Thelaststepwhenanexceptionhandlerwillfinishit'sworkallregisterswillberestoredfromthestackwiththeRESTORE_C_REGSandRESTORE_EXTRA_REGSmacrosandcontrolwillbereturnedaninterruptedtask.That'sall.Nowweknowaboutpreparationbeforeaninterrupt/exceptionhandlerwillstarttoexecuteandwecangodirectlytotheimplementationofthehandlers.
Bothhandlersdo_debuganddo_int3definedinthearch/x86/kernel/traps.csourcecodefileandhavetwosimilarthings:Allinterrupts/exceptionshandlersmarkedwiththedotraplinkageprefixthatexpandstothe:
#definedotraplinkage__visible
#define__visible__attribute__((externally_visible))
whichtellstocompilerthatsomethingelseusesthisfunction(inourcasethesefunctionsarecalledfromtheassemblyinterruptpreparationcode).Andalsotheytakestwoparameters:
pointertothept_regsstructurewhichcontainsregistersoftheinterruptedtask;errorcode.
Firstofalllet'sconsiderdo_debughandler.Thisfunctionstartsfromthegettingpreviousstatewiththeist_enterfunctionfromthearch/x86/kernel/traps.c.Wecallitbecauseweneedtoknow,didwecometotheinterrupthandlerfromthekernel
Implementationofainterruptsandexceptionshandlers
LinuxInside
189Interrupthandlers
modeorusermode.
prev_state=ist_enter(regs);
Theist_enterfunctionreturnspreviousstatecontextstateandexecutesacouplepreprartionsbeforewecontinuetohandleanexception.Itstartsfromthecheckofthepreviousmodewiththeuser_mode_vmmacro.Ittakespt_regsstructurewhichcontainsasetofregistersoftheinterruptedtaskandreturns1ifwecamefromuserspaceand0ifwecamefromkernelspace.Accordingtothepreviousmodeweexecuteexception_enterifwearefromtheuserspaceorinformRCUifwearefromkrenelspace:
...
if(user_mode_vm(regs)){
prev_state=exception_enter();
}else{
rcu_nmi_enter();
prev_state=IN_KERNEL;
}
...
...
...
returnprev_state;
AfterthisweloadtheDR6debugregisterstothedr6variablewiththecalloftheget_debugregmacrofromthearch/x86/include/asm/debugreg.h:
get_debugreg(dr6,6);
dr6&=~DR6_RESERVED;
TheDR6debugregisterisdebugstatusregistercontainsinformationaboutthereasonforstoppingthe#DBordebugexceptionhandler.Afterweloadeditsvaluetothedr6variablewefilteroutallreservedbits(4:12bits).Inthenextstepwecheckdr6registerandpreviousstatewiththefollowingifconditionexpression:
if(!dr6&&user_mode_vm(regs))
user_icebp=1;
Ifdr6doesnotshowanyreasonswhywecaughtthistrapwesetuser_icebptoonewhichmeansthatuser-codewantstogetSIGTRAPsignal.Inthenextstepwecheckwasitkmemchecktrapandifyeswegotoexit:
if((dr6&DR_STEP)&&kmemcheck_trap(regs))
gotoexit;
Afterwedidallthesechecks,weclearthedr6register,cleartheDEBUGCTLMSR_BTFflagwhichprovidessingle-steponbranchesdebugging,setdr6registerforthecurrentthreadandincreasedebug_stack_usageper-cpu)variablewiththe:
set_debugreg(0,6);
clear_tsk_thread_flag(tsk,TIF_BLOCKSTEP);
tsk->thread.debugreg6=dr6;
debug_stack_usage_inc();
Aswesaveddr6,wecanallowirqs:
LinuxInside
190Interrupthandlers
staticinlinevoidpreempt_conditional_sti(structpt_regs*regs)
{
preempt_count_inc();
if(regs->flags&X86_EFLAGS_IF)
local_irq_enable();
}
moreaboutlocal_irq_enabledandrelatedstuffyoucanreadinthesecondpartaboutinterruptshandlingintheLinuxkernel.Inthenextstepwecheckthepreviousmodewasvirtual8086andhandlethetrap:
if(regs->flags&X86_VM_MASK){
handle_vm86_trap((structkernel_vm86_regs*)regs,error_code,X86_TRAP_DB);
preempt_conditional_cli(regs);
debug_stack_usage_dec();
gotoexit;
}
...
...
...
exit:
ist_exit(regs,prev_state);
Ifwecamenotfromthevirtual8086mode,weneedtocheckdr6registerandpreviousmodeaswediditabove.Herewecheckifstepmodedebuggingisenabledandwearenotfromtheusermode,weenabledstepmodedebugginginthedr6copyinthecurrentthread,setTIF_SINGLE_STEPfalgandre-enableTrapflagfortheusermode:
if((dr6&DR_STEP)&&!user_mode(regs)){
tsk->thread.debugreg6&=~DR_STEP;
set_tsk_thread_flag(tsk,TIF_SINGLESTEP);
regs->flags&=~X86_EFLAGS_TF;
}
ThenwegetSIGTRAPsignalcode:
si_code=get_si_code(tsk->thread.debugreg6);
andsenditforusericebptraps:
if(tsk->thread.debugreg6&(DR_STEP|DR_TRAP_BITS)||user_icebp)
send_sigtrap(tsk,regs,error_code,si_code);
preempt_conditional_cli(regs);
debug_stack_usage_dec();
exit:
ist_exit(regs,prev_state);
Intheendwedisabledirqs,decrementvalueofthedebug_stack_usageandexitfromtheexceptionhandlerwiththeist_exitfunction.
Thesecondexceptionhandlerisdo_int3definedinthesamesourcecodefile-arch/x86/kernel/traps.c.Inthedo_int3wemakesalmostthesamethatinthedo_debughandler.Wegetthepreviousstatewiththeist_enter,incrementanddecrementthedebug_stack_usageper-cpuvariable,enabledanddisablelocalinterrupts.Butofcoursethereisonedifferencebetweenthesetwohandlers.Weneedtolockandthansyncprocessorcoresduringbreakpointpatching.
That'sall.
LinuxInside
191Interrupthandlers
ItistheendofthethirdpartaboutinterruptsandinterrupthandlingintheLinuxkernel.WesawtheinitializationoftheInterruptdescriptortableinthepreviouspartwiththe#DBand#BPgatesandstartedtodiveintopreparationbeforecontrolwillbetransferedtoanexceptionhandlerandimplementationofsomeinterrupthandlersinthispart.Inthenextpartwewillcontinuetodiveintothisthemeandwillgonextbythesetup_archfunctionandwilltrytounderstandinterruptshandlingrelatedstuff.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
DebugregistersIntel80385INT3gccTSSGNUassembly.errordirectivedwarf2CFIdirectivesIRQsystemcallswapgsSIGTRAPPer-CPUvariableskgdbACPIPreviouspart
Conclusion
Links
LinuxInside
192Interrupthandlers
ThisisfourthpartaboutaninterruptsandexceptionshandlingintheLinuxkernelandinthepreviouspartwesawfirstearly#DBand#BPexceptionshandlersfromthearch/x86/kernel/traps.c.Westoppedontherightaftertheearly_trap_initfunctionthatcalledinthesetup_archfunctionwhichdefinedinthearch/x86/kernel/setup.c.InthispartwewillcontinuetodiveintoaninterruptsandexceptionshandlingintheLinuxkernelforx86_64andcontinuetodoitfromfromtheplacewhereweleftoffinthelastpart.Firstthingwhichisrelatedtotheinterruptsandexceptionshandlingisthesetupofthe#PForpagefaulthandlerwiththeearly_trap_pf_initfunction.Let'sstartfromit.
Theearly_trap_pf_initfunctiondefinedinthearch/x86/kernel/traps.c.Itusesset_intr_gatemacrothatfillesInterruptDescriptorTablewiththegivenentry:
void__initearly_trap_pf_init(void)
{
#ifdefCONFIG_X86_64
set_intr_gate(X86_TRAP_PF,page_fault);
#endif
}
Thismacrodefinedinthearch/x86/include/asm/desc.h.Wealreadysawmacroslikethisinthepreviouspart-set_system_intr_gateandset_intr_gate_ist.Thismacrochecksthatgivenvectornumberisnotgreaterthan255(maximumvectornumber)andcalls_set_gatefunctionasset_system_intr_gateandset_intr_gate_istdidit:
#defineset_intr_gate(n,addr)\
do{\
BUG_ON((unsigned)n>0xFF);\
_set_gate(n,GATE_INTERRUPT,(void*)addr,0,0,\
__KERNEL_CS);\
_trace_set_gate(n,GATE_INTERRUPT,(void*)trace_##addr,\
0,0,__KERNEL_CS);\
}while(0)
Theset_intr_gatemacrotakestwoparameters:
vectornumberofainterrupt;addressofaninterrupthandler;
Inourcasetheyare:
X86_TRAP_PF-14;page_fault-theinterrupthandlerentrypoint.
TheX86_TRAP_PFistheelementofenumwhichdefinedinthearch/x86/include/asm/traprs.h:
enum{
...
...
...
InterruptsandInterruptHandling.Part4.
Initializationofnon-earlyinterruptgates
Earlypagefaulthandler
LinuxInside
193Initializationofnon-earlyinterruptgates
...
X86_TRAP_PF,/*14,PageFault*/
...
...
...
}
Whentheearly_trap_pf_initwillbecalled,theset_intr_gatewillbeexpandedtothecallofthe_set_gatewhichwillfilltheIDTwiththehandlerforthepagefault.Nowlet'slookontheimplementationofthepage_faulthandler.Thepage_faulthandlerdefinedinthearch/x86/kernel/entry_64.Sassemblysourcecodefileasallexceptionshandlers.Let'slookonit:
trace_idtentrypage_faultdo_page_faulthas_error_code=1
Wesawinthepreviousparthow#DBand#BPhandlersdefined.Theyweredefinedwiththeidtentrymacro,butherewecanseetrace_idtentry.ThismacrodefinedinthesamesourcecodefileanddependsontheCONFIG_TRACINGkernelconfigurationoption:
#ifdefCONFIG_TRACING
.macrotrace_idtentrysymdo_symhas_error_code:req
idtentrytrace(\sym)trace(\do_sym)has_error_code=\has_error_code
idtentry\sym\do_symhas_error_code=\has_error_code
.endm
#else
.macrotrace_idtentrysymdo_symhas_error_code:req
idtentry\sym\do_symhas_error_code=\has_error_code
.endm
#endif
WewillnotdiveintoexceptionsTracingnow.IfCONFIG_TRACINGisnotset,wecanseethattrace_idtentrymacrojustexpandstothenormalidtentry.Wealreadysawimplementationoftheidtentrymacrointhepreviouspart,solet'sstartfromthepage_faultexceptionhandler.
Aswecanseeintheidtentrydefinition,thehandlerofthepage_faultisdo_page_faultfunctionwhichdefinedinthearch/x86/mm/fault.candasallexceptionshandlersittakestwoarguments:
regs-pt_regsstructurethatholdsstateofaninterruptedprocess;error_code-errorcodeofthepagefaultexception.
Let'slookinsidethisfunction.Firstofallwereadcontentofthecr2controlregister:
dotraplinkagevoidnotrace
do_page_fault(structpt_regs*regs,unsignedlongerror_code)
{
unsignedlongaddress=read_cr2();
...
...
...
}
Thisregistercontainsalinearaddresswhichcausedpagefault.Inthenextstepwemakeacalloftheexception_enterfunctionfromtheinclude/linux/context_tracking.h.Theexception_enterandexception_exitarefunctionsfromcontexttrackingsubsytemintheLinuxkernelusedbytheRCUtoremoveitsdependencyonthetimertickwhileaprocessorrunsinuserspace.Almostintheeveryexceptionhandlerwewillseesimilarcode:
enumctx_stateprev_state;
LinuxInside
194Initializationofnon-earlyinterruptgates
prev_state=exception_enter();
...
...//exceptionhandlerhere
...
exception_exit(prev_state);
Theexception_enterfunctionchecksthatcontexttrackingisenabledwiththecontext_tracking_is_enabledandifitisinenabledstate,wegetpreviouscontextwithtethis_cpu_read(moreaboutthis_cpu_*operationsyoucanreadintheDocumentation).Afterthisitcallscontext_tracking_user_exitfunctionwhichinformsthatInformthecontexttrackingthattheprocessorisexitinguserspacemodeandenteringthekernel:
staticinlineenumctx_stateexception_enter(void)
{
enumctx_stateprev_ctx;
if(!context_tracking_is_enabled())
return0;
prev_ctx=this_cpu_read(context_tracking.state);
context_tracking_user_exit();
returnprev_ctx;
}
Thestatecanbeoneofthe:
enumctx_state{
IN_KERNEL=0,
IN_USER,
}state;
Andintheendwereturnpreviouscontext.Betweentheexception_enterandexception_exitwecallactualpagefaulthandler:
__do_page_fault(regs,error_code,address);
The__do_page_faultisdefinedinthesamesourcecodefileasdo_page_fault-arch/x86/mm/fault.c.Inthebinggingofthe__do_page_faultwecheckstateofthekmemcheckchecker.Thekmemcheckdetectswarnsaboutsomeusesofuninitializedmemory.Weneedtocheckitbecausepagefaultcanbecausedbykmemcheck:
if(kmemcheck_active(regs))
kmemcheck_hide(regs);
prefetchw(&mm->mmap_sem);
AfterthiswecanseethecalloftheprefetchwwhichexecutesinstructionwiththesamenamewhichfetchesX86_FEATURE_3DNOWtogetexclusivecacheline.Themainpurposeofprefetchingistohidethelatencyofamemoryaccess.Inthenextstepwecheckthatwegotpagefaultnotinthekernelspacewiththefollowingconditiion:
if(unlikely(fault_in_kernel_space(address))){
...
...
...
}
wherefault_in_kernel_spaceis:
LinuxInside
195Initializationofnon-earlyinterruptgates
staticintfault_in_kernel_space(unsignedlongaddress)
{
returnaddress>=TASK_SIZE_MAX;
}
TheTASK_SIZE_MAXmacroexpandstothe:
#defineTASK_SIZE_MAX((1UL<<47)-PAGE_SIZE)
or0x00007ffffffff000.Payattentiononunlikelymacro.TherearetwomacrosintheLinuxkernel:
#definelikely(x)__builtin_expect(!!(x),1)
#defineunlikely(x)__builtin_expect(!!(x),0)
YoucanoftenfindthesemacrosinthecodeoftheLinuxkernel.Mainpurposeofthesemacrosisoptimization.Sometimesthissituationisthatweneedtochecktheconditionofthecodeandweknowthatitwillrarelybetrueorfalse.Withthesemacroswecantelltothecompileraboutthis.Forexample
staticintproc_root_readdir(structfile*file,structdir_context*ctx)
{
if(ctx->pos<FIRST_PROCESS_ENTRY){
interror=proc_readdir(file,ctx);
if(unlikely(error<=0))
returnerror;
...
...
...
}
Herewecanseeproc_root_readdirfunctionwhichwillbecalledwhentheLinuxVFSneedstoreadtherootdirectorycontents.Ifconditionmarkedwithunlikely,compilercanputfalsecoderightafterbranching.Nowlet'sbacktotheouraddresscheck.Comparisonbetweenthegivenaddressandthe0x00007ffffffff000willgiveustoknow,waspagefaultinthekernelmodeorusermode.Afterthischeckweknowit.Afterthis__do_page_faultroutinewilltrytounderstandtheproblemthatprovokedpagefaultexceptionandthenwillpassaddresstotheappropriteroutine.Itcanbekmemcheckfault,spuriousfault,kprobesfaultandetc.Willnotdiveintoimplementationdetailsofthepagefaultexceptionhandlerinthispart,becauseweneedtoknowmanydifferentconceptswhichareprovidedbytheLinuxkerne,butwillseeitinthechapteraboutthememorymanagementintheLinuxkernel.
Therearemanydifferentfunctioncallsaftertheearly_trap_pf_initinthesetup_archfunctionfromdifferentkernelsubsystems,buttherearenooneinterruptsandexceptionshandlingrelated.So,wehavetogobackwherewecamefrom-start_kernelfunctionfromtheinit/main.c.Thefirstthingsafterthesetup_archisthetrap_initfunctionfromthearch/x86/kernel/traps.c.Thisfunctionmakesinitializationoftheremainingexceptionshandlers(rememberthatwealreadysetup3handlresforthe#DB-debugexception,#BP-breakpointexceptionand#PF-pagefaultexception).Thetrap_initfunctionstartsfromthecheckoftheExtendedIndustryStandardArchitecture:
#ifdefCONFIG_EISA
void__iomem*p=early_ioremap(0x0FFFD9,4);
if(readl(p)=='E'+('I'<<8)+('S'<<16)+('A'<<24))
EISA_bus=1;
early_iounmap(p,4);
Backtostart_kernel
LinuxInside
196Initializationofnon-earlyinterruptgates
#endif
NotethatitdependsontheCONFIG_EISAkernelconfigurationparameterwhichrepresetnsEISAsupport.Hereweuseearly_ioremapfunctiontomapI/Omemoryonthepagetables.Weusereadlfunctiontoreadfirst4bytesfromthemappedregionandiftheyareequaltoEISAstringwesetEISA_bustoone.Intheendwejustunmappreviouslymappedregion.Moreaboutearly_ioremapyoucanreadinthepartwhichdescribesFix-MappedAddressesandioremap.
AfterthiswestarttofilltheInterruptDescriptorTablewiththedifferentinterruptgates.Firstofallweset#DEorDivideErrorand#NMIorNon-maskableInterrupt:
set_intr_gate(X86_TRAP_DE,divide_error);
set_intr_gate_ist(X86_TRAP_NMI,&nmi,NMI_STACK);
Weuseset_intr_gatemacrotosettheinterruptgateforthe#DEexceptionandset_intr_gate_istforthe#NMI.Youcanrememberthatwealreadyusedthesemacroswhenwehavesettheinterruptsgatesforthepagefaulthandler,debughandlerandetc,youcanfindexplanationofitinthepreviouspart.Afterthiswesetupexceptiongatesforthefollowingexceptions:
set_system_intr_gate(X86_TRAP_OF,&overflow);
set_intr_gate(X86_TRAP_BR,bounds);
set_intr_gate(X86_TRAP_UD,invalid_op);
set_intr_gate(X86_TRAP_NM,device_not_available);
Herewecansee:
#OForOverflowexception.ThisexceptionindicatesthatanoverflowtrapoccurredwhenanspecialINTOinstructionwasexecuted;#BRorBOUNDRangeexceededexception.ThisexceptionindeicatesthataBOUND-range-exceedfaultoccuredwhenaBOUNDinstructionwasexecuted;#UDorInvalidOpcodeexception.Occurswhenaprocessorattemptedtoexecuteinvalidorreservedopcode,processorattemptedtoexecuteinstructionwithinvalidoperand(s)andetc;#NMorDeviceNotAvailableexception.Occurswhentheprocessortriestoexecutex87FPUfloatingpointinstructionwhileEMflaginthecontrolregistercr0wasset.
Inthenextstepwesettheinterruptgateforthe#DForDoublefaultexception:
set_intr_gate_ist(X86_TRAP_DF,&double_fault,DOUBLEFAULT_STACK);
Thisexceptionoccurswhenprocessordetectedasecondexceptionwhilecallinganexceptionhandlerforapriorexception.Inusualwaywhentheprocessordetectsanotherexceptionwhiletryingtocallanexceptionhandler,thetwoexceptionscanbehandledserially.Iftheprocessorcannothandlethemserially,itsignalsthedouble-faultor#DFexception.
Thefollowingsetoftheinterruptgatesis:
set_intr_gate(X86_TRAP_OLD_MF,&coprocessor_segment_overrun);
set_intr_gate(X86_TRAP_TS,&invalid_TSS);
set_intr_gate(X86_TRAP_NP,&segment_not_present);
set_intr_gate_ist(X86_TRAP_SS,&stack_segment,STACKFAULT_STACK);
set_intr_gate(X86_TRAP_GP,&general_protection);
set_intr_gate(X86_TRAP_SPURIOUS,&spurious_interrupt_bug);
set_intr_gate(X86_TRAP_MF,&coprocessor_error);
set_intr_gate(X86_TRAP_AC,&alignment_check);
LinuxInside
197Initializationofnon-earlyinterruptgates
Herewecanseesetupforthefollowingexceptionhandlers:
#CSOorCoprocessorSegmentOverrun-thisexceptionindicatesthatmathcoprocessorofanoldprocessordetectedapageorsegmentviolation.Modernprocessorsdonotgeneratethisexception#TSorInvalidTSSexception-indicatesthattherewasanerrorrelatedtotheTaskStateSegment.#NPorSegementNotPresentexceptionindicatesthatthepresentflagofasegmentorgatedescriptorisclearduringattempttoloadoneofcs,ds,es,fs,orgsregister.#SSorStackFaultexceptionindicatesoneofthestackrelatedconditionswasdetected,forexampleanot-presentstacksegmentisdetectedwhenattemptingtoloadthessregister.#GPorGeneralProtectionexceptionindicatesthattheprocessordetectedoneofaclassofprotectionviolationscalledgeneral-protectionviolations.Therearemanydifferentconditionsthatcancausegeneral-procetionexception.Forexampleloadingthess,ds,es,fs,orgsregisterwithasegmentselectorforasystemsegment,writingtoacodesegmentoraread-onlydatasegment,referencinganentryintheInterruptDescriptorTable(followinganinterruptorexception)thatisnotaninterrupt,trap,ortaskgateandmanymanymore.SpuriousInterrupt-ahardwareinterruptthatisunwanted.#MForx87FPUFloating-PointErrorexceptioncausedwhenthex87FPUhasdetectedafloatingpointerror.#ACorAlignmentCheckexceptionIndicatesthattheprocessordetectedanunalignedmemoryoperandwhenalignmentcheckingwasenabled.
Afterthatwesetupthisexceptiongates,wecanseesetupoftheMachine-Checkexception:
#ifdefCONFIG_X86_MCE
set_intr_gate_ist(X86_TRAP_MC,&machine_check,MCE_STACK);
#endif
NotethatitdependsontheCONFIG_X86_MCEkernelconfigurationoptionandindicatesthattheprocessordetectedaninternalmachineerrororabuserror,orthatanexternalagentdetectedabuserror.ThenextexceptiongateisfortheSIMDFloating-Pointexception:
set_intr_gate(X86_TRAP_XF,&simd_coprocessor_error);
whichindicatestheprocessorhasdetectedanSSEorSSE2orSSE3SIMDfloating-pointexception.TherearesixclassesofnumericexceptionconditionsthatcanoccurwhileexecutinganSIMDfloating-pointinstruction:
InvalidoperationDivide-by-zeroDenormaloperandNumericoverflowNumericunderflowInexactresult(Precision)
Inthenextstepwefilltheused_vectorsarraywhichdefinedinthearch/x86/include/asm/desc.hheaderfileandrepresentsbitmap:
DECLARE_BITMAP(used_vectors,NR_VECTORS);
ofthefirst32interrupts(moreaboutbitmapsintheLinuxkernelyoucanreadinthepartwhichdescribescpumasksandbitmaps)
for(i=0;i<FIRST_EXTERNAL_VECTOR;i++)
LinuxInside
198Initializationofnon-earlyinterruptgates
set_bit(i,used_vectors)
whereFIRST_EXTERNAL_VECTORis:
#defineFIRST_EXTERNAL_VECTOR0x20
Afterthiswesetuptheinterruptgatefortheia32_syscallandadd0x80totheused_vectorsbitmap:
#ifdefCONFIG_IA32_EMULATION
set_system_intr_gate(IA32_SYSCALL_VECTOR,ia32_syscall);
set_bit(IA32_SYSCALL_VECTOR,used_vectors);
#endif
ThereisCONFIG_IA32_EMULATIONkernelconfigurationoptiononx86_64Linuxkernels.Thisoptionprovidesabilitytoexecute32-bitprocessesincompatibility-mode.Inthenextpartswewillseehowitworks,inthemeantimeweneedonlytoknowthatthereisyetanotherinterruptgateintheIDTwiththevectornumber0x80.InthenextstepwemapsIDTtothefixmaparea:
__set_fixmap(FIX_RO_IDT,__pa_symbol(idt_table),PAGE_KERNEL_RO);
idt_descr.address=fix_to_virt(FIX_RO_IDT);
andwriteitsaddresstotheidt_descr.address(moreaboutfix-mappedaddressesyoucanreadinthesecondpartoftheLinuxkernelmemorymanagementchapter).Afterthiswecanseethecallofthecpu_initfunctionthatdefinedinthearch/x86/kernel/cpu/common.c.Thisfunctionmakesinitializationoftheallper-cpustate.Inthebeginnigofthecpu_initwedothefollowingthings:Firstofallwewaitwhilecurrentcpuisinitializedandthanwecallthecr4_init_shadowfunctionwhichstoresshadowcopyofthecr4controlregisterforthecurrentcpuandloadCPUmicrocodeifneedwiththefollowingfunctioncalls:
wait_for_master_cpu(cpu);
cr4_init_shadow();
load_ucode_ap();
NextwegettheTaskStateSegementforthecurrentcpuandorig_iststructurewhichrepresentsoriginInterruptStackTablevalueswiththe:
t=&per_cpu(cpu_tss,cpu);
oist=&per_cpu(orig_ist,cpu);
AswegotvaluesoftheTaskStateSegementandInterruptStackTableforthecurrentprocessor,weclearfollowingbitsinthecr4controlregister:
cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
withthiswedisablevm86extension,virtualinterrupts,timestamp(RDTSCcanonlybeexecutedwiththehighestprivilege)anddebugextension.AfterthiswereloadtheGlolbalDescriptoTableandInterruptDescriptortablewiththe:
switch_to_new_gdt(cpu);
loadsegment(fs,0);
load_current_idt();
LinuxInside
199Initializationofnon-earlyinterruptgates
AfterthiswesetuparrayoftheThread-LocalStorageDescriptors,configureNXandloadCPUmicrocode.Nowistimetosetupandloadper-cpuTaskStateSegements.WearegoinginaloopthroughtheallexceptionstackwhichisN_EXCEPTION_STACKSor4andfillitwithInterruptStackTables:
if(!oist->ist[0]){
char*estacks=per_cpu(exception_stacks,cpu);
for(v=0;v<N_EXCEPTION_STACKS;v++){
estacks+=exception_stack_sizes[v];
oist->ist[v]=t->x86_tss.ist[v]=
(unsignedlong)estacks;
if(v==DEBUG_STACK-1)
per_cpu(debug_stack_addr,cpu)=(unsignedlong)estacks;
}
}
AswehavefilledTaskStateSegementswiththeInterruptStackTableswecansetTSSdescriptorforthecurrentprocessorandloaditwiththe:
set_tss_desc(cpu,t);
load_TR_desc();
whereset_tss_descmacrofromthearch/x86/include/asm/desc.hwritesgivendescriptortotheGlobalDescriptorTableofthegivenprocessor:
#defineset_tss_desc(cpu,addr)__set_tss_desc(cpu,GDT_ENTRY_TSS,addr)
staticinlinevoid__set_tss_desc(unsignedcpu,unsignedintentry,void*addr)
{
structdesc_struct*d=get_cpu_gdt_table(cpu);
tss_desctss;
set_tssldt_descriptor(&tss,(unsignedlong)addr,DESC_TSS,
IO_BITMAP_OFFSET+IO_BITMAP_BYTES+
sizeof(unsignedlong)-1);
write_gdt_entry(d,entry,&tss,DESC_TSS);
}
andload_TR_descmacroexpandstotheltrorLoadTaskRegisterinstruction:
#defineload_TR_desc()native_load_tr_desc()
staticinlinevoidnative_load_tr_desc(void)
{
asmvolatile("ltr%w0"::"q"(GDT_ENTRY_TSS*8));
}
Intheendofthetrap_initfunctionwecanseethefollowingcode:
set_intr_gate_ist(X86_TRAP_DB,&debug,DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP,&int3,DEBUG_STACK);
...
...
...
#ifdefCONFIG_X86_64
memcpy(&nmi_idt_table,&idt_table,IDT_ENTRIES*16);
set_nmi_gate(X86_TRAP_DB,&debug);
set_nmi_gate(X86_TRAP_BP,&int3);
#endif
LinuxInside
200Initializationofnon-earlyinterruptgates
Herewecopyidt_tabletothenmi_dit_tableandsetupexceptionhandlersforthe#DBorDebugexceptionand#BRorBreakpointexception.Youcanrememberthatwealreadysettheseinterruptgatesinthepreviouspart,sowhydoweneedtosetupitagain?Wesetupitagainbecausewhenweinitializeditbeforeintheearly_trap_initfunction,theTaskStateSegementwasnotreadyyet,butnowitisreadyafterthecallofthecpu_initfunction.
That'sall.Soonwewillconsiderallhandlersoftheseinterrupts/exceptions.
ItistheendofthefourthpartaboutinterruptsandinterrupthandlingintheLinuxkernel.WesawtheinitializationoftheTaskStateSegmentinthispartandinitializationofthedifferentinterrupthandlersasDivideError,PageFaultexcetpionandetc.Youcannotedthatwesawjustinitializationstuf,andwilldiveintodetailsabouthandlersfortheseexceptions.Inthenextpartwewillstarttodoit.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
pagefaultInterruptDescriptorTableTracingcr2RCUthiscpu*operationskmemcheckprefetchw3DNowCPUcachesVFSLinuxkernelmemorymanagementFix-MappedAddressesandioremapExtendedIndustryStandardArchitectureINTisntructionINTOBOUNDopcodecontrolregisterx87FPUMCEexceptionSIMDcpumasksandbitmapsNXTaskStateSegmentPreviouspart
Conclusion
Links
LinuxInside
201Initializationofnon-earlyinterruptgates
ThisisthefifthpartaboutaninterruptsandexceptionshandlingintheLinuxkernelandinthepreviouspartwestoppedonthesettingofinterruptgatestotheInterruptdescriptorTable.Wediditinthetrap_initfunctionfromthearch/x86/kernel/traps.csourcecodefile.Wesawonlysettingoftheseinterruptgatesinthepreviouspartandinthecurrentpartwewillseeimplementationoftheexceptionhandlersforthesegates.Thepreparationbeforeanexceptionhandlerwillbeexecutedisinthearch/x86/entry/entry_64.Sassemblyfileandoccursintheidtentrymacrothatdefinesexceptionsentrypoints:
idtentrydivide_errordo_divide_errorhas_error_code=0
idtentryoverflowdo_overflowhas_error_code=0
idtentryinvalid_opdo_invalid_ophas_error_code=0
idtentryboundsdo_boundshas_error_code=0
idtentrydevice_not_availabledo_device_not_availablehas_error_code=0
idtentrycoprocessor_segment_overrundo_coprocessor_segment_overrunhas_error_code=0
idtentryinvalid_TSSdo_invalid_TSShas_error_code=1
idtentrysegment_not_presentdo_segment_not_presenthas_error_code=1
idtentryspurious_interrupt_bugdo_spurious_interrupt_bughas_error_code=0
idtentrycoprocessor_errordo_coprocessor_errorhas_error_code=0
idtentryalignment_checkdo_alignment_checkhas_error_code=1
idtentrysimd_coprocessor_errordo_simd_coprocessor_errorhas_error_code=0
Theidtentrymacrodoesfollowingpreparationbeforeanactualexceptionhandler(do_divide_errorforthedivide_error,do_overflowfortheoverflowandetc.)willgetcontrol.Inanotherwordstheidtentrymacroallocatesplacefortheregisters(pt_regsstructure)onthestack,pushesdummyerrorcodeforthestackconsistencyifaninterrupt/exceptionhasnoerrorcode,checksthesegmentselectorinthecssegmentregisterandswitchesdependsonthepreviousstate(userspaceorkernelspace).Afterallofthesepreparationsitmakesacallofanactualinterrupt/exceptionhandler:
.macroidtentrysymdo_symhas_error_code:reqparanoid=0shift_ist=-1
ENTRY(\sym)
...
...
...
call\do_sym
...
...
...
END(\sym)
.endm
Afteranexceptionhandlerwillfinishitswork,theidtentrymacrorestoresstackandgeneralpurposeregistersofaninterruptedtaskandexecutesiretinstruction:
ENTRY(paranoid_exit)
...
...
...
RESTORE_EXTRA_REGS
RESTORE_C_REGS
REMOVE_PT_GPREGS_FROM_STACK8
INTERRUPT_RETURN
END(paranoid_exit)
whereINTERRUPT_RETURNis:
InterruptsandInterruptHandling.Part5.
Implementationofexceptionhandlers
LinuxInside
202Implementationofsomeexceptionhandlers
#defineINTERRUPT_RETURNjmpnative_iret
...
ENTRY(native_iret)
.globalnative_irq_return_iret
native_irq_return_iret:
iretq
Moreabouttheidtentrymacroyoucanreadinthethirtpartofthehttp://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.htmlchapter.Ok,nowwesawthepreparationbeforeanexceptionhandlerwillbeexecutedandnowtimetolookonthehandlers.Firstofalllet'slookonthefollowinghandlers:
divide_erroroverflowinvalid_opcoprocessor_segment_overruninvalid_TSSsegment_not_presentstack_segmentalignment_check
Allthesehandlersdefinedinthearch/x86/kernel/traps.csourcecodefilewiththeDO_ERRORmacro:
DO_ERROR(X86_TRAP_DE,SIGFPE,"divideerror",divide_error)
DO_ERROR(X86_TRAP_OF,SIGSEGV,"overflow",overflow)
DO_ERROR(X86_TRAP_UD,SIGILL,"invalidopcode",invalid_op)
DO_ERROR(X86_TRAP_OLD_MF,SIGFPE,"coprocessorsegmentoverrun",coprocessor_segment_overrun)
DO_ERROR(X86_TRAP_TS,SIGSEGV,"invalidTSS",invalid_TSS)
DO_ERROR(X86_TRAP_NP,SIGBUS,"segmentnotpresent",segment_not_present)
DO_ERROR(X86_TRAP_SS,SIGBUS,"stacksegment",stack_segment)
DO_ERROR(X86_TRAP_AC,SIGBUS,"alignmentcheck",alignment_check)
AswecanseetheDO_ERRORmacrotakes4parameters:
Vectornumberofaninterrupt;Signalnumberwhichwillbesenttotheinterruptedprocess;Stringwhichdescribesanexception;Exceptionhandlerentrypoint.
Thismacrodefinedinthesamesoucecodefileandexpandstothefunctionwiththedo_handlername:
#defineDO_ERROR(trapnr,signr,str,name)\
dotraplinkagevoiddo_##name(structpt_regs*regs,longerror_code)\
{\
do_error_trap(regs,error_code,str,trapnr,signr);\
}
Noteonthe##tokens.Thisisspecialfeature-GCCmacroConcatenationwhichconcatenatestwogivenstrings.Forexample,firstDO_ERRORinourexamplewillexpandstothe:
dotraplinkagevoiddo_divide_error(structpt_regs*regs,longerror_code)\
{
...
}
WecanseethatallfunctionswhicharegeneratedbytheDO_ERRORmacrojustmakeacallofthedo_error_trapfunction
LinuxInside
203Implementationofsomeexceptionhandlers
fromthearch/x86/kernel/traps.c.Let'slookonimplementationofthedo_error_trapfunction.
Thedo_error_trapfunctionstartsandendsfromthetwofollowingfunctions:
enumctx_stateprev_state=exception_enter();
...
...
...
exception_exit(prev_state);
fromtheinclude/linux/context_tracking.h.ThecontexttrackingintheLinuxkernelsubsystemwhichprovidekernelboundariesprobestokeeptrackofthetransitionsbetweenlevelcontextswithtwobasicinitialcontexts:userorkernel.Theexception_enterfunctionchecksthatcontexttrackingisenabled.Afterthisifitisenabled,theexception_enterreadspreviouscontextandcomparesitwiththeCONTEXT_KERNEL.Ifthepreviouscontextisuser,wecallcontext_tracking_exitfunctionfromthekernel/context_tracking.cwhichinformthecontexttrackingsubsystemthataprocessorisexitingusermodeandenteringthekernelmode:
if(!context_tracking_is_enabled())
return0;
prev_ctx=this_cpu_read(context_tracking.state);
if(prev_ctx!=CONTEXT_KERNEL)
context_tracking_exit(prev_ctx);
returnprev_ctx;
Ifpreviouscontextisnonuser,wejustreturnit.Thepre_ctxhasenumctx_statetypewhichdefinedintheinclude/linux/context_tracking_state.handlooksas:
enumctx_state{
CONTEXT_KERNEL=0,
CONTEXT_USER,
CONTEXT_GUEST,
}state;
Thesecondfunctionisexception_exitdefinedinthesameinclude/linux/context_tracking.hfileandchecksthatcontexttrackingisenabledandcallthecontert_tracking_enterfunctionifthepreviouscontextwasuser:
staticinlinevoidexception_exit(enumctx_stateprev_ctx)
{
if(context_tracking_is_enabled()){
if(prev_ctx!=CONTEXT_KERNEL)
context_tracking_enter(prev_ctx);
}
}
Thecontext_tracking_enterfunctioninformsthecontexttrackingsubsystemthataprocessorisgoingtoentertotheusermodefromthekernelmode.Wecanseethefollowingcodebetweentheexception_enterandexception_exit:
if(notify_die(DIE_TRAP,str,regs,error_code,trapnr,signr)!=
NOTIFY_STOP){
conditional_sti(regs);
do_trap(trapnr,signr,str,regs,error_code,
Traphandlers
LinuxInside
204Implementationofsomeexceptionhandlers
fill_trap_info(regs,signr,trapnr,&info));
}
Firstofallitcallsthenotify_diefunctionwhichdefinedinthekernel/notifier.c.Togetnotifiedforkernelpanic,kerneloops,Non-MaskableInterruptorothereventsthecallerneedstoinsertitselfinthenotify_diechainandthenotify_diefunctiondoesit.TheLinuxkernelhasspecialmechanismthatallowskerneltoaskwhensomethinghappensandthismechanismcallednotifiersornotifierchains.ThismechanismusedforexamplefortheUSBhotplugevents(lookonthedrivers/usb/core/notify.c),forthememoryhotplug(lookontheinclude/linux/memory.h,thehotplug_memory_notifiermacroandetc...),systemrebootsandetc.Anotifierchainisthusasimple,singly-linkedlist.WhenaLinuxkernelsubsystemwantstobenotifiedofspecificevents,itfillsoutaspecialnotifier_blockstructureandpassesittothenotifier_chain_registerfunction.Aneventcanbesentwiththecallofthenotifier_call_chainfunction.Firstofallthenotify_diefunctionfillsdie_argsstructurewiththetrapnumber,trapstring,registersandothervalues:
structdie_argsargs={
.regs=regs,
.str=str,
.err=err,
.trapnr=trap,
.signr=sig,
}
andreturnstheresultoftheatomic_notifier_call_chainfunctionwiththedie_chain:
staticATOMIC_NOTIFIER_HEAD(die_chain);
returnatomic_notifier_call_chain(&die_chain,val,&args);
whichjustexpandstotheatomit_notifier_headstructurethatcontainslockandnotifier_block:
structatomic_notifier_head{
spinlock_tlock;
structnotifier_block__rcu*head;
};
Theatomic_notifier_call_chainfunctioncallseachfunctioninanotifierchaininturnandreturnsthevalueofthelastnotifierfunctioncalled.Ifthenotify_dieinthedo_error_trapdoesnotreturnNOTIFY_STOPweexecuteconditional_stifunctionfromthearch/x86/kernel/traps.cthatchecksthevalueoftheinterruptflagandenablesinterruptdependsonit:
staticinlinevoidconditional_sti(structpt_regs*regs)
{
if(regs->flags&X86_EFLAGS_IF)
local_irq_enable();
}
moreaboutlocal_irq_enablemacroyoucanreadinthesecondpartofthischapter.Thenextandlastcallinthedo_error_trapisthedo_trapfunction.Firstofallthedo_trapfunctiondefinedthetskvariablewhichhastrak_structtypeandrepresentsthecurrentinterruptedprocess.Afterthedefinitionofthetsk,wecanseethecallofthedo_trap_no_signalfunction:
structtask_struct*tsk=current;
if(!do_trap_no_signal(tsk,trapnr,str,regs,error_code))
return;
LinuxInside
205Implementationofsomeexceptionhandlers
Thedo_trap_no_signalfunctionmakestwochecks:
DidwecomefromtheVirtual8086mode;Didwecomefromthekernelspace.
if(v8086_mode(regs)){
...
}
if(!user_mode(regs)){
...
}
return-1;
WewillnotconsiderfirstcasebecausethelongmodedoesnotsupporttheVirtual8086mode.Inthesecondcaseweinvokefixup_exceptionfunctionwhichwilltrytorecoverafaultanddieifwecan't:
if(!fixup_exception(regs)){
tsk->thread.error_code=error_code;
tsk->thread.trap_nr=trapnr;
die(str,regs,error_code);
}
Thediefunctiondefinedinthearch/x86/kernel/dumpstack.csourcecodefile,printsusefulinformationaboutstack,registers,kernelmodulesandcausedkerneloops.Ifwecamefromtheuserspacethedo_trap_no_signalfunctionwillreturn-1andtheexecutionofthedo_trapfunctionwillcontinue.Ifwepassedthroughthedo_trap_no_signalfunctionanddidnotexitfromthedo_trapafterthis,itmeansthatpreviouscontextwas-user.MostexceptionscausedbytheprocessorareinterpretedbyLinuxaserrorconditions,forexampledivisionbyzero,invalidopcodeandetc.WhenanexceptionoccurstheLinuxkernelsendsasignaltotheinterruptedprocessthatcausedtheexceptiontonotifyitofanincorrectcondition.So,inthedo_trapfunctionweneedtosendasignalwiththegivennumber(SIGFPEforthedivideerror,SIGILLfortheoverflowexceptionandetc...).Firstofallwesaveerrorcodeandvectornumberinthecurrentinterruptsprocesswiththefillingthread.error_codeandthread_trap_nr:
tsk->thread.error_code=error_code;
tsk->thread.trap_nr=trapnr;
Afterthiswemakeacheckdoweneedtoprintinformationaboutunhandledsignalsfortheinterruptedprocess.Wecheckthatshow_unhandled_signalsvariableisset,thatunhandled_signalfunctionfromthekernel/signal.cwillreturnunhandledsignal(s)andprintkratelimit:
#ifdefCONFIG_X86_64
if(show_unhandled_signals&&unhandled_signal(tsk,signr)&&
printk_ratelimit()){
pr_info("%s[%d]trap%sip:%lxsp:%lxerror:%lx",
tsk->comm,tsk->pid,str,
regs->ip,regs->sp,error_code);
print_vma_addr("in",regs->ip);
pr_cont("\n");
}
#endif
Andsendagivensignaltointerruptedprocess:
force_sig_info(signr,info?:SEND_SIG_PRIV,tsk);
LinuxInside
206Implementationofsomeexceptionhandlers
Thisistheendofthedo_trap.WejustsawgenericimplementationforeightdifferentexceptionswhicharedefinedwiththeDO_ERRORmacro.Nowlet'slookonanotherexceptionhandlers.
Thenextexceptionis#DForDoublefault.Thisexceptionoccurrswhentheprocessordetectedasecondexceptionwhilecallinganexceptionhandlerforapriorexception.Wesetthetrapgateforthisexceptioninthepreviouspart:
set_intr_gate_ist(X86_TRAP_DF,&double_fault,DOUBLEFAULT_STACK);
NotethatthisexceptionrunsontheDOUBLEFAULT_STACKInterruptStackTablewhichhasindex-1:
#defineDOUBLEFAULT_STACK1
Thedouble_faultishandlerforthisexceptionanddefinedinthearch/x86/kernel/traps.c.Thedouble_faulthandlerstartsfromthedefinitionoftwovariables:stringthatdescribesexcetpionandinterruptedprocess,asotherexceptionhandlers:
staticconstcharstr[]="doublefault";
structtask_struct*tsk=current;
Thehandlerofthedoublefaultexceptionsplittedontwoparts.Thefirstpartisthecheckwhichchecksthatafaultisanon-ISTfaultontheespfix64stack.Actuallytheiretinstructionrestoresonlythebottom16bitswhenreturningtoa16bitsegment.Theespfixfeaturesolvesthisproblem.Soifthenon-ISTfaultontheespfix64stackwemodifythestacktomakeitlooklikeGeneralProtectionFault:
structpt_regs*normal_regs=task_pt_regs(current);
memmove(&normal_regs->ip,(void*)regs->sp,5*8);
ormal_regs->orig_ax=0;
regs->ip=(unsignedlong)general_protection;
regs->sp=(unsignedlong)&normal_regs->orig_ax;
return;
Inthesecondcasewedoalmostthesamethatwedidinthepreviousexcetpionhandlers.Thefirstisthecalloftheist_enterfunctionthatdiscardspreviouscontext,userinourcase:
ist_enter(regs);
AndafterthiswefilltheinterruptedprocesswiththevectornumberoftheDoublefaultexcetpionanderrorcodeaswediditintheprevioushandlers:
tsk->thread.error_code=error_code;
tsk->thread.trap_nr=X86_TRAP_DF;
Nextweprintusefulinformationaboutthedoublefault(PIDnumber,registerscontent):
#ifdefCONFIG_DOUBLEFAULT
df_debug(regs,error_code);
#endif
Doublefault
LinuxInside
207Implementationofsomeexceptionhandlers
Anddie:
for(;;)
die(str,regs,error_code);
That'sall.
Thenextexceptionisthe#NMorDevicenotavailable.TheDevicenotavailableexceptioncanoccurdependingonthesethings:
Theprocessorexecutedanx87FPUfloating-pointinstructionwhiletheEMflagincontrolregistercr0wasset;TheprocessorexecutedawaitorfwaitinstructionwhiletheMPandTSflagsofregistercr0wereset;Theprocessorexecutedanx87FPU,MMXorSSEinstructionwhiletheTSfalgincontrolregistercr0wassetandtheEMflagisclear.
ThehandleroftheDevicenotavailableexceptionisthedo_device_not_availablefunctionanditdefinedinthearch/x86/kernel/traps.csourcecodefiletoo.Itstartsandendsfromthegettingofthepreviouscontext,asothertrapswhichwesawinthebeginningofthispart:
enumctx_stateprev_state;
prev_state=exception_enter();
...
...
...
exception_exit(prev_state);
InthenextstepwecheckthatFPUisnoteager:
BUG_ON(use_eager_fpu());
WhenweswitchintoataskorinterruptwemayavoidloadingtheFPUstate.Ifataskwilluseit,wecatchDevicenotAvailableexceptionexception.IfweloadingtheFPUstateduringtaskswitching,theFPUiseager.Inthenextstepwecheckcr0controlregisterontheEMflagwhichcanshowusisx87floatingpointunitpresent(flagclear)ornot(flagset):
#ifdefCONFIG_MATH_EMULATION
if(read_cr0()&X86_CR0_EM){
structmath_emu_infoinfo={};
conditional_sti(regs);
info.regs=regs;
math_emulate(&info);
exception_exit(prev_state);
return;
}
#endif
Ifthex87floatingpointunitnotpresented,weenableinterruptswiththeconditional_sti,fillthemath_emu_info(definedinthearch/x86/include/asm/math_emu.h)structurewiththeregistersofaninterrupttaskandcallmath_emulatefunctionfromthearch/x86/math-emu/fpu_entry.c.Asyoucanunderstandfromfunction'sname,itemulatesX87FPUunit(moreaboutthe
Devicenotavailableexceptionhandler
LinuxInside
208Implementationofsomeexceptionhandlers
x87wewillknowinthespecialchapter).Inotherway,ifX86_CR0_EMflagisclearwhichmeansthatx87FPUunitispresented,wecallthefpu__restorefunctionfromthearch/x86/kernel/fpu/core.cwhichcopiestheFPUregistersfromthefpustatetothelivehardwareregisters.AfterthisFPUinstructionscanbeused:
fpu__restore(¤t->thread.fpu);
Thenextexceptionisthe#GPorGeneralprotectionfault.Thisexceptionoccurswhentheprocessordetectedoneofaclassofprotectionviolationscalledgeneral-protectionviolations.Itcanbe:
Exceedingthesegmentlimitwhenaccessingthecs,ds,es,fsorgssegments;Loadingthess,ds,es,fsorgsregisterwithasegmentselectorforasystemsegment.;Violatinganyoftheprivilegerules;andother...
Theexceptionhandlerforthisexceptionisthedo_general_protectionfromthearch/x86/kernel/traps.c.Thedo_general_protectionfunctionstartsandendsasotherexceptionhandlersfromthegettingofthepreviouscontext:
prev_state=exception_enter();
...
exception_exit(prev_state);
AfterthisweenableinterruptsiftheyweredisabledandcheckthatwecamefromtheVirtual8086mode:
conditional_sti(regs);
if(v8086_mode(regs)){
local_irq_enable();
handle_vm86_fault((structkernel_vm86_regs*)regs,error_code);
gotoexit;
}
Aslongmodedoesnotsupportthismode,wewillnotconsiderexceptionhandlingforthiscase.Inthenextstepcheckthatpreviousmodewaskernelmodeandtrytofixthetrap.Ifwecan'tfixthecurrentgeneralprotectionfaultexceptionwefilltheinterruptedprocesswiththevectornumberanderrorcodeoftheexceptionandaddittothenotify_diechain:
if(!user_mode(regs)){
if(fixup_exception(regs))
gotoexit;
tsk->thread.error_code=error_code;
tsk->thread.trap_nr=X86_TRAP_GP;
if(notify_die(DIE_GPF,"generalprotectionfault",regs,error_code,
X86_TRAP_GP,SIGSEGV)!=NOTIFY_STOP)
die("generalprotectionfault",regs,error_code);
gotoexit;
}
Ifwecanfixexceptionwegototheexitlabelwhichexitsfromexceptionstate:
exit:
exception_exit(prev_state);
Generalprotectionfaultexceptionhandler
LinuxInside
209Implementationofsomeexceptionhandlers
IfwecamefromusermodewesendSIGSEGVsignaltotheinterruptedprocessfromusermodeaswediditinthedo_trapfunction:
if(show_unhandled_signals&&unhandled_signal(tsk,SIGSEGV)&&
printk_ratelimit()){
pr_info("%s[%d]generalprotectionip:%lxsp:%lxerror:%lx",
tsk->comm,task_pid_nr(tsk),
regs->ip,regs->sp,error_code);
print_vma_addr("in",regs->ip);
pr_cont("\n");
}
force_sig_info(SIGSEGV,SEND_SIG_PRIV,tsk);
That'sall.
ItistheendofthefifthpartoftheInterruptsandInterruptHandlingchapterandwesawimplementationofsomeinterrupthandlersinthispart.InthenextpartwewillcontinuetodiveintointerruptandexceptionhandlersandwillseehandlerfortheNon-MaskableInterrupts,handlingofthemathcoprocessorandSIMDcoprocessorexceptionsandmanymanymore.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
InterruptdescriptorTableiretinstructionGCCmacroConcatenationkernelpanickerneloopsNon-MaskableInterrupthotpluginterruptflaglongmodesignalprintkcoprocessorSIMDInterruptStackTablePIDx87FPUcontrolregisterMMXPreviouspart
Conclusion
Links
LinuxInside
210Implementationofsomeexceptionhandlers
ItissixthpartoftheInterruptsandInterruptHandlingintheLinuxkernelchapterandinthepreviouspartwesawimplementationofsomeexceptionhandlersfortheGeneralProtectionFaultexception,divideexception,invalidopcodeexceptionsandetc.AsIwroteinthepreviouspartwewillseeimplementationsoftherestexceptionsinthispart.Wewillseeimplementationofthefollowinghandlers:
Non-Maskableinterrupt;BOUNDRangeExceededException;Coprocessorexception;SIMDcoprocessorexception.
inthispart.So,let'sstart.
ANon-Maskableinterruptisahardwareinterruptthatcannotbeignorebystandardmaskingtechniques.Inageneralway,anon-maskableinterruptcanbegeneratedineitheroftwoways:
Externalhardwareassertsthenon-maskableinterruptpinontheCPU.TheprocessorreceivesamessageonthesystembusortheAPICserialbuswithadeliverymodeNMI.
WhentheprocessorreceivesaNMIfromoneofthesesources,theprocessorhandlesitimmediatelybycallingtheNMIhandlerpointedtobyinterruptvectorwhichhasnumber2(seetableinthefirstpart).WealreadyfilledtheInterruptDescriptorTablewiththevectornumber,addressofthenmiinterrupthandlerandNMI_STACKInterruptStackTableentry:
set_intr_gate_ist(X86_TRAP_NMI,&nmi,NMI_STACK);
inthetrap_initfunctionwhichdefinedinthearch/x86/kernel/traps.csourcecodefile.Inthepreviouspartswesawthatentrypointsoftheallinterrupthandlersaredefinedwiththe:
.macroidtentrysymdo_symhas_error_code:reqparanoid=0shift_ist=-1
ENTRY(\sym)
...
...
...
END(\sym)
.endm
macrofromthearch/x86/entry/entry_64.Sassemblysourcecodefile.ButthehandleroftheNon-Maskableinterruptsisnotdefinedwiththismacro.Ithasownentrypoint:
ENTRY(nmi)
...
...
...
END(nmi)
InterruptsandInterruptHandling.Part6.
Non-maskableinterrupthandler
Non-Maskableinterrupthandling
LinuxInside
211HandlingNon-Maskableinterrupts
inthesamearch/x86/entry/entry_64.Sassemblyfile.LetsdiveintoitandwilltrytounderstandhowNon-Maskableinterrupthandlerworks.Thenmihandlersstartsfromthecallofthe:
PARAVIRT_ADJUST_EXCEPTION_FRAME
macrobutwewillnotdiveintodetailsaboutitinthispart,becausethismacrorelatedtotheParavirtualizationstuffwhichwewillseeinanotherchapter.Afterthissavethecontentoftherdxregisteronthestack:
pushq%rdx
Andallocatedcheckthatcswasnotthekernelsegmentwhenannon-maskableinterruptoccurs:
cmpl$__KERNEL_CS,16(%rsp)
jnefirst_nmi
The__KERNEL_CSmacrodefinedinthearch/x86/include/asm/segment.handrepresentedseconddescriptorintheGlobalDescriptorTable:
#defineGDT_ENTRY_KERNEL_CS2
#define__KERNEL_CS(GDT_ENTRY_KERNEL_CS*8)
moreaboutGDTyoucanreadinthesecondpartoftheLinuxkernelbootingprocesschapter.Ifcsisnotkernelsegment,itmeansthatitisnotnestedNMIandwejumponthefirst_nmilabel.Let'sconsiderthiscase.Firstofallweputaddressofthecurrentstackpointertotherdxandpushes1tothestackinthefirst_nmilabel:
first_nmi:
movq(%rsp),%rdx
pushq$1
Whydowepush1onthestack?Asthecommentsays:WeallowbreakpointsinNMIs.Onthex86_64,likeotherarchitectures,theCPUwillnotexecuteanotherNMIuntilthefirstNMIiscomplete.ANMIinterruptfinishedwiththeiretinstructionlikeotherinterruptsandexceptionsdoit.IftheNMIhandlertriggerseitherapagefaultorbreakpointoranotherexceptionwhichareuseiretinstructiontoo.IfthishappenswhileinNMIcontext,theCPUwillleaveNMIcontextandanewNMImaycomein.Theiretusedtoreturnfromthoseexceptionswillre-enableNMIsandwewillgetnestednon-maskableinterrupts.TheproblemtheNMIhandlerwillnotreturntothestatethatitwas,whentheexceptiontriggered,butinsteaditwillreturntoastatethatwillallownewNMIstopreempttherunningNMIhandler.IfanotherNMIcomesinbeforethefirstNMIhandleriscomplete,thenewNMIwillwritealloverthepreemptedNMIsstack.WecanhavenestedNMIswherethenextNMIisusingthetopofthestackofthepreviousNMI.Itmeansthatwecannotexecuteitbecauseanestednon-maskableinterruptwillcorruptstackofapreviousnon-maskableinterrupt.That'swhywehaveallocatedspaceonthestackfortemporaryvariable.WewillcheckthisvariablethatitwassetwhenapreviousNMIisexecutingandclearifitisnotnestedNMI.Wepush1heretothepreviouslyallocatedspaceonthestacktodenotethatanon-maskableinterruptexecutedcurrently.RememberthatwhenandNMIoranotherexceptionoccurswehavethefollowingstackframe:
+------------------------+
|SS|
|RSP|
|RFLAGS|
|CS|
|RIP|
+------------------------+
LinuxInside
212HandlingNon-Maskableinterrupts
andalsoanerrorcodeifanexceptionhasit.So,afterallofthesemanipulationsourstackframewilllooklikethis:
+------------------------+
|SS|
|RSP|
|RFLAGS|
|CS|
|RIP|
|RDX|
|1|
+------------------------+
Inthenextstepweallocateyetanother40bytesonthestack:
subq$(5*8),%rsp
andpushesthecopyoftheoriginalstackframeaftertheallocatedspace:
.rept5
pushq11*8(%rsp)
.endr
withthe.reptassemblydirective.Weneedinthecopyoftheoriginalstackframe.Generallyweneedintwocopiesoftheinterruptstack.Firstiscopiedinterruptsstack:savedstackframeandcopiedstackframe.Nowwepushesoriginalstackframetothesavedstackframewhichlocatesafterthejustallocated40bytes(copiedstackframe).ThisstackframeisusedtofixupthecopiedstackframethatanestedNMImaychange.Thesecond-copiedstackframemodifiedbyanynestedNMIstoletthefirstNMIknowthatwetriggeredasecondNMIandweshouldrepeatthefirstNMIhandler.Ok,wehavemadefirstcopyoftheoriginalstackframe,nowtimetomakesecondcopy:
addq$(10*8),%rsp
.rept5
pushq-6*8(%rsp)
.endr
subq$(5*8),%rsp
Afterallofthesemanipulationsourstackframewillbelikethis:
+-------------------------+
|originalSS|
|originalReturnRSP|
|originalRFLAGS|
|originalCS|
|originalRIP|
+-------------------------+
|tempstorageforrdx|
+-------------------------+
|NMIexecutingvariable|
+-------------------------+
|copiedSS|
|copiedReturnRSP|
|copiedRFLAGS|
|copiedCS|
|copiedRIP|
+-------------------------+
|SavedSS|
|SavedReturnRSP|
|SavedRFLAGS|
|SavedCS|
|SavedRIP|
LinuxInside
213HandlingNon-Maskableinterrupts
+-------------------------+
Afterthiswepushdummyerrorcodeonthestackaswediditalreadyinthepreviousexceptionhandlersandallocatespaceforthegeneralpurposeregistersonthestack:
pushq$-1
ALLOC_PT_GPREGS_ON_STACK
WealreadysawimplementationoftheALLOC_PT_GREGS_ON_STACKmacrointhethirdpartoftheinterruptschapter.Thismacrodefinedinthearch/x86/entry/calling.handyetanotherallocates120bytesonstackforthegeneralpurposeregisters,fromtherditother15:
.macroALLOC_PT_GPREGS_ON_STACKaddskip=0
addq$-(15*8+\addskip),%rsp
.endm
Afterspaceallocationforthegeneralregisterswecanseecalloftheparanoid_entry:
callparanoid_entry
Wecanrememberfromthepreviouspartsthislabel.Itpushesgeneralpurposeregistersonthestack,readsMSR_GS_BASEModelSpecificregisterandchecksitsvalue.IfthevalueoftheMSR_GS_BASEisnegative,wecamefromthekernelmodeandjustreturnfromtheparanoid_entry,inotherwayitmeansthatwecamefromtheusermodeandneedtoexecuteswapgsinstructionwhichwillchangeusergswiththekernelgs:
ENTRY(paranoid_entry)
cld
SAVE_C_REGS8
SAVE_EXTRA_REGS8
movl$1,%ebx
movl$MSR_GS_BASE,%ecx
rdmsr
testl%edx,%edx
js1f
SWAPGS
xorl%ebx,%ebx
1:ret
END(paranoid_entry)
Notethataftertheswapgsinstructionwezeroedtheebxregister.Nexttimewewillcheckcontentofthisregisterandifweexecutedswapgsthanebxmustcontain0and1inotherway.Inthenextstepwestorevalueofthecr2controlregistertother12register,becausetheNMIhandlercancausepagefaultandcorruptthevalueofthiscontrolregister:
movq%cr2,%r12
NowtimetocallactualNMIhandler.Wepushtheaddressofthept_regstotherdi,errorcodetothersiandcallthedo_nmihandler:
movq%rsp,%rdi
movq$-1,%rsi
calldo_nmi
LinuxInside
214HandlingNon-Maskableinterrupts
Wewillbacktothedo_nmilittlelaterinthispart,butnowlet'slookwhatoccursafterthedo_nmiwillfinishitsexecution.Afterthedo_nmihandlerwillbefinishedwecheckthecr2register,becausewecangotpagefaultduringdo_nmiperformedandifwegotitwerestoreoriginalcr2,inotherwaywejumponthelabel1.Afterthiswetestcontentoftheebxregister(rememberitmustcontain0ifwehaveusedswapgsinstructionand1ifwedidn'tuseit)andexecuteSWAPGS_UNSAFE_STACKifitcontains1orjumptothenmi_restorelabel.TheSWAPGS_UNSAFE_STACKmacrojustexpandstotheswapgsinstruction.Inthenmi_restorelabelwerestoregeneralpurposeregisters,clearallocatedspaceonthestackforthisregistersclearourtemporaryvariableandexitfromtheinterrupthandlerwiththeINTERRUPT_RETURNmacro:
movq%cr2,%rcx
cmpq%rcx,%r12
je1f
movq%r12,%cr2
1:
testl%ebx,%ebx
jnznmi_restore
nmi_swapgs:
SWAPGS_UNSAFE_STACK
nmi_restore:
RESTORE_EXTRA_REGS
RESTORE_C_REGS
/*Poptheextrairetframeatonce*/
REMOVE_PT_GPREGS_FROM_STACK6*8
/*CleartheNMIexecutingstackvariable*/
movq$0,5*8(%rsp)
INTERRUPT_RETURN
whereINTERRUPT_RETURNisdefinedinthearch/x86/include/irqflags.handjustexpandstotheiretinstruction.That'sall.
Nowlet'sconsidercasewhenanotherNMIinterruptoccurredwhenpreviousNMIinterruptdidn'tfinishitsexecution.Youcanrememberfromthebeginningofthispartthatwe'vemadeacheckthatwecamefromuserspaceandjumponthefirst_nmiinthiscase:
cmpl$__KERNEL_CS,16(%rsp)
jnefirst_nmi
NotethatinthiscaseitisfirstNMIeverytime,becauseifthefirstNMIcatchedpagefault,breakpointoranotherexceptionitwillbeexecutedinthekernelmode.Ifwedidn'tcomefromuserspace,firstofallwetestourtemporaryvariable:
cmpl$1,-8(%rsp)
jenested_nmi
andifitissetto1wejumptothenested_nmilabel.Ifitisnot1,wetesttheISTstack.InthecaseofnestedNMIswecheckthatweareabovetherepeat_nmi.Inthiscaseweignoreit,inotherwaywecheckthatweabovethanend_repeat_nmiandjumponthenested_nmi_outlabel.
Nowlet'slookonthedo_nmiexceptionhandler.Thisfunctiondefinedinthearch/x86/kernel/nmi.csourcecodefileandtakestwoparameters:
addressofthept_regs;errorcode.
asallexceptionhandlers.Thedo_nmistartsfromthecallofthenmi_nesting_preprocessfunctionandendswiththecallofthenmi_nesting_postprocess.Thenmi_nesting_preprocessfunctionchecksthatwelikelydonotworkwiththedebugstackandifweonthedebugstacksettheupdate_debug_stackper-cpuvariableto1andcallthedebug_stack_set_zerofunctionfromthearch/x86/kernel/cpu/common.c.Thisfunctionincreasesthedebug_stack_use_ctrper-cpuvariableandloadsnewInterruptDescriptorTable:
LinuxInside
215HandlingNon-Maskableinterrupts
staticinlinevoidnmi_nesting_preprocess(structpt_regs*regs)
{
if(unlikely(is_debug_stack(regs->sp))){
debug_stack_set_zero();
this_cpu_write(update_debug_stack,1);
}
}
Thenmi_nesting_postprocessfunctioncheckstheupdate_debug_stackper-cpuvariablewhichwesetinthenmi_nesting_preprocessandresetsdebugstackorinanotherwordsitloadsoriginInterruptDescriptorTable.Afterthecallofthenmi_nesting_preprocessfunction,wecanseethecallofthenmi_enterinthedo_nmi.Thenmi_enterincreaseslockdep_recursionfieldoftheinterruptedprocess,updatepreemptcounterandinformstheRCUsubsystemaboutNMI.Thereisalsonmi_exitfunctionthatdoesthesamestuffasnmi_enter,butvice-versa.Afterthenmi_enterweincrease__nmi_countintheirq_statstructureandcallthedefault_do_nmifunction.Firstofallinthedefault_do_nmiwechecktheaddressofthepreviousnmiandupdateaddressofthelastnmitotheactual:
if(regs->ip==__this_cpu_read(last_nmi_rip))
b2b=true;
else
__this_cpu_write(swallow_nmi,false);
__this_cpu_write(last_nmi_rip,regs->ip);
AfterthisfirstofallweneedtohandleCPU-specificNMIs:
handled=nmi_handle(NMI_LOCAL,regs,b2b);
__this_cpu_add(nmi_stats.normal,handled);
Andthannon-specificNMIsdependsonitsreason:
reason=x86_platform.get_nmi_reason();
if(reason&NMI_REASON_MASK){
if(reason&NMI_REASON_SERR)
pci_serr_error(reason,regs);
elseif(reason&NMI_REASON_IOCHK)
io_check_error(reason,regs);
__this_cpu_add(nmi_stats.external,1);
return;
}
That'sall.
ThenextexceptionistheBOUNDrangeexceededexception.TheBOUNDinstructiondeterminesifthefirstoperand(arrayindex)iswithintheboundsofanarrayspecifiedthesecondoperand(boundsoperand).Iftheindexisnotwithinbounds,aBOUNDrangeexceededexceptionor#BRisoccurred.Thehandlerofthe#BRexceptionisthedo_boundsfunctionthatdefinedinthearch/x86/kernel/traps.c.Thedo_boundshandlerstartswiththecalloftheexception_enterfunctionandendswiththecalloftheexception_exit:
prev_state=exception_enter();
if(notify_die(DIE_TRAP,"bounds",regs,error_code,
X86_TRAP_BR,SIGSEGV)==NOTIFY_STOP)
RangeExceededException
LinuxInside
216HandlingNon-Maskableinterrupts
gotoexit;
...
...
...
exception_exit(prev_state);
return;
Afterwehavegotthestateofthepreviouscontext,weaddtheexceptiontothenotify_diechainandifitwillreturnNOTIFY_STOPwereturnfromtheexception.Moreaboutnotifychainsandthecontexttrackingfunctionsyoucanreadinthepreviouspart.Inthenextstepweenableinterruptsiftheyweredisabledwiththecontidional_stifunctionthatchecksIFflagandcallthelocal_irq_enabledependsonitsvalue:
conditional_sti(regs);
if(!user_mode(regs))
die("bounds",regs,error_code);
andcheckthatifwedidn'tcamefromusermodewesendSIGSEGVsignalwiththediefunction.AfterthiswecheckisMPXenabledornot,andifthisfeatureisdisabledwejumpontheexit_traplabel:
if(!cpu_feature_enabled(X86_FEATURE_MPX)){
gotoexit_trap;
}
whereweexecute`do_trap`function(moreaboutityoucanfindinthepreviouspart):
```C
exit_trap:
do_trap(X86_TRAP_BR,SIGSEGV,"bounds",regs,error_code,NULL);
exception_exit(prev_state);
IfMPXfeatureisenabledwechecktheBNDSTATUSwiththeget_xsave_field_ptrfunctionandifitiszero,itmeansthattheMPXwasnotresponsibleforthisexception:
bndcsr=get_xsave_field_ptr(XSTATE_BNDCSR);
if(!bndcsr)
gotoexit_trap;
Afterallofthis,thereisstillonlyonewaywhenMPXisresponsibleforthisexception.WewillnotdiveintothedetailsaboutIntelMemoryProtectionExtensionsinthispart,butwillseeitinanotherchapter.
Thenexttwoexceptionsarex87FPUFloating-PointErrorexceptionor#MFandSIMDFloating-PointExceptionor#XF.Thefirstexceptionoccurswhenthex87FPUhasdetectedfloatingpointerror.Forexampledividebyzero,numericoverflowandetc.ThesecondexceptionoccurswhentheprocessorhasdetectedSSE/SSE2/SSE3SIMDfloating-pointexception.Itcanbethesameasforthex87FPU.Thehandlersfortheseexceptionsaredo_coprocessor_erroranddo_simd_coprocessor_erroraredefinedinthearch/x86/kernel/traps.candverysimilaroneachother.Theybothmakeacallofthemath_errorfunctionfromthesamesourcecodefilebutpassdifferentvectornumber.Thedo_coprocessor_errorpassesX86_TRAP_MFvectornumbertothemath_error:
dotraplinkagevoiddo_coprocessor_error(structpt_regs*regs,longerror_code)
{
enumctx_stateprev_state;
prev_state=exception_enter();
CoprocessorexceptionandSIMDexception
LinuxInside
217HandlingNon-Maskableinterrupts
math_error(regs,error_code,X86_TRAP_MF);
exception_exit(prev_state);
}
anddo_simd_coprocessor_errorpassesX86_TRAP_XFtothemath_errorfunction:
dotraplinkagevoid
do_simd_coprocessor_error(structpt_regs*regs,longerror_code)
{
enumctx_stateprev_state;
prev_state=exception_enter();
math_error(regs,error_code,X86_TRAP_XF);
exception_exit(prev_state);
}
Firstofallthemath_errorfunctiondefinescurrentinterruptedtask,addressofitsfpu,stringwhichdescribesanexception,addittothenotify_diechainandreturnfromtheexceptionhandlerifitwillreturnNOTIFY_STOP:
structtask_struct*task=current;
structfpu*fpu=&task->thread.fpu;
siginfo_tinfo;
char*str=(trapnr==X86_TRAP_MF)?"fpuexception":
"simdexception";
if(notify_die(DIE_TRAP,str,regs,error_code,trapnr,SIGFPE)==NOTIFY_STOP)
return;
Afterthiswecheckthatwearefromthekernelmodeandifyeswewilltrytofixanexcetpionwiththefixup_exceptionfunction.Ifwecannotwefillthetaskwiththeexception'serrorcodeandvectornumberanddie:
if(!user_mode(regs)){
if(!fixup_exception(regs)){
task->thread.error_code=error_code;
task->thread.trap_nr=trapnr;
die(str,regs,error_code);
}
return;
}
Ifwecamefromtheusermode,wesavethefpustate,fillthetaskstructurewiththevectornumberofanexceptionandsiginfo_twiththenumberofsignal,errno,theaddresswhereexceptionoccurredandsignalcode:
fpu__save(fpu);
task->thread.trap_nr=trapnr;
task->thread.error_code=error_code;
info.si_signo=SIGFPE;
info.si_errno=0;
info.si_addr=(void__user*)uprobe_get_trap_addr(regs);
info.si_code=fpu__exception_code(fpu,trapnr);
Afterthiswecheckthesignalcodeandifitisnon-zerowereturn:
if(!info.si_code)
return;
LinuxInside
218HandlingNon-Maskableinterrupts
OrsendtheSIGFPEsignalintheend:
force_sig_info(SIGFPE,&info,task);
That'sall.
ItistheendofthesixthpartoftheInterruptsandInterruptHandlingchapterandwesawimplementationofsomeexceptionhandlersinthispart,likenon-maskableinterrupt,SIMDandx87FPUfloatingpointexception.Finallywehavefinsihedwiththetrap_initfunctioninthispartandwillgoaheadinthenextpart.Thenextourpointistheexternalinterruptsandtheearly_irq_initfunctionfromtheinit/main.c.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
GeneralProtectionFaultopcodeNon-MaskableBOUNDinstructionCPUsocketInterruptDescriptorTableInterruptStackTableParavirtualization.reptSIMDCoprocessorx86_64iretpagefaultbreakpointGlobalDescriptorTablestackframeModelSpecificregiserpercpuRCUMPXx87FPUPreviouspart
Conclusion
Links
LinuxInside
219HandlingNon-Maskableinterrupts
ThisistheseventhpartoftheInterruptsandInterruptHandlingintheLinuxkernelchapterandinthepreviouspartwehavefinishedwiththeexceptionswhicharegeneratedbytheprocessor.Inthispartwewillcontinuetodivetotheinterrupthandlingandwillstartwiththeexternalhandwareinterrupthandling.Asyoucanremember,inthepreviouspartwehavefinsihedwiththetrap_initfunctionfromthearch/x86/kernel/trap.candthenextstepisthecalloftheearly_irq_initfunctionfromtheinit/main.c.
InterruptsaresignalthataresentacrossIRQorInterruptRequestLinebyahardwareorsoftware.Externalhardwareinterruptsallowdeviceslikekeyboard,mouseandetc,toindicatethatitneedsattentionoftheprocessor.OncetheprocessorreceivestheInterruptRequest,itwilltemporarystopexecutionoftherunningprogramandinvokespecialroutinewhichdependsonaninterrupt.Wealreadyknowthatthisroutineiscalledinterrupthandler(orhowwewillcallitISRorInterruptServiceRoutinefromthispart).TheISRorInterruptHandlerRoutinecanbefoundinInterruptVectortablethatislocatedatfixedaddressinthememory.Aftertheinterruptishandledprocessorresumestheinterruptedprocess.Attheboot/initializationtime,theLinuxkernelidentifiesalldevicesinthemachine,andappropriateinterrupthandlersareloadedintotheinterrupttable.Aswesawinthepreviousparts,mostexceptionsarehandledsimplybythesendingaUnixsignaltotheinterruptedprocess.That'swhykerneliscanhandleanexceptionquickly.Unfortunatellywecannotusethisapproachfortheexternalhandwareinterrupts,becauseoftentheyarriveafter(andsometimeslongafter)theprocesstowhichtheyarerelatedhasbeensuspended.SoitwouldmakenosensetosendaUnixsignaltothecurrentprocess.Externalinterrupthandlingdependsonthetypeofaninterrupt:
I/Ointerrupts;Timerinterrupts;Interprocessorinterrupts.
Iwilltrytodescribealltypesofinterruptsinthisbook.
Generally,ahandlerofanI/Ointerruptmustbeflexibleenoughtoserviceseveraldevicesatthesametime.ForexmapleinthePCIbusarchitectureseveraldevicesmaysharethesameIRQline.InthesimplestwaytheLinuxkernelmustdofollowingthingwhenanI/Ointerruptoccured:
SavethevalueofanIRQandtheregister'scontentsonthekernelstack;SendanacknowledgmenttothehardwarecontrollerwhichisservicingtheIRQline;Executetheinterruptserviceroutine(nextwewillcallitISR)whichisassociatedwiththedevice;Restoreregistersandreturnfromaninterrupt;
Ok,weknowalittletheoryandnowlet'sstartwiththeearly_irq_initfunction.Theimplementationoftheearly_irq_initfunctionisinthekernel/irq/irqdesc.c.Thisfunctionmakeearlyinitialziationoftheirq_descstructure.Theirq_descstructureisthefoundationofinterruptmanagementcodeintheLinuxkernel.Anarrayofthisstructure,whichhasthesamename-irq_desc,keepstrackofeveryinterruptrequestsourceintheLinuxkernel.Thisstructuredefinedintheinclude/linux/irqdesc.handasyoucannoteitdependsontheCONFIG_SPARSE_IRQkernelconfigurationoption.Thiskernelconfigurationoptionenablessupportforsparseirqs.Theirq_descstructurecontainsmanydifferentfiels:
irq_common_data-perirqandchipdatapasseddowntochipfunctions;status_use_accessors-containsstatusoftheinterruptsourcewhichiscanbecombinationofofthevaluesfromtheenumfromtheinclude/linux/irq.handdifferentmacroswhicharedefinedinthesamesourcecodefile;kstat_irqs-irqstatsper-cpu;handle_irq-highlevelirq-eventshandler;action-identifiestheinterruptserviceroutinestobeinvokedwhentheIRQoccurs;
InterruptsandInterruptHandling.Part7.
Introductiontoexternalinterrupts
LinuxInside
220Diveintoexternalhardwareinterrupts
irq_count-counterofinterruptoccurrencesontheIRQline;depth-0iftheIRQlineisenabledandapositivevalueifithasbeendisabledatleastonce;last_unhandled-agingtimerforunhandledcount;irqs_unhandled-countoftheunhandledinterrupts;lock-aspinlockusedtoserializetheaccessestotheIRQdescriptor;pending_mask-pendingrebalancedinterrupts;owner-anownerofinterruptdescriptor.Interruptdescriptorscanbeallocatedfrommodules.Thisfieldisneedtoprovedrefcountonthemodulewhichprovidestheinterrupts;andetc.
Ofcourseitisnotallfieldsoftheirq_descstructure,becauseitistoolongtodescribeeachfieldofthisstructure,butwewillseeitallsoon.Nowlet'sstarttodiveintotheimplementationoftheearly_irq_initfunction.
Now,let'slookontheimplementationoftheearly_irq_initfunction.Notethatimplementationoftheearly_irq_initfunctiondependsontheCONFIG_SPARSE_IRQkernelconfigurationoption.Nowweconsiderimplementationoftheearly_irq_initfunctionwhentheCONFIG_SPARSE_IRQkernelconfigurationoptionisnotset.Thisfunctionstartsfromthedeclarationofthefollowingvariables:irqdescriptorscounter,loopcounter,memorynodeandtheirq_descdescriptor:
int__initearly_irq_init(void)
{
intcount,i,node=first_online_node;
structirq_desc*desc;
...
...
...
}
ThenodeisanonlineNUMAnodewhichdependsontheMAX_NUMNODESvaluewhichdependsontheCONFIG_NODES_SHIFTkernelconfigurationparameter:
#defineMAX_NUMNODES(1<<NODES_SHIFT)
...
...
...
#ifdefCONFIG_NODES_SHIFT
#defineNODES_SHIFTCONFIG_NODES_SHIFT
#else
#defineNODES_SHIFT0
#endif
AsIalreadywrote,implementationofthefirst_online_nodemacrodependsontheMAX_NUMNODESvalue:
#ifMAX_NUMNODES>1
#definefirst_online_nodefirst_node(node_states[N_ONLINE])
#else
#definefirst_online_node0
Thenode_statesistheenumwhichdefinedintheinclude/linux/nodemask.handrepresentthesetofthestatesofanode.Inourcasewearesearchinganonlinenodeanditwillbe0ifMAX_NUMNODESisoneorzero.IftheMAX_NUMNODESisgreaterthanone,thenode_states[N_ONLINE]willreturn1andthefirst_nodemacrowillbeexpandstothecallofthe__first_nodefunctionwhichwillreturnminimalorthefirstonlinenode:
Earlyexternalinterruptsinitialization
LinuxInside
221Diveintoexternalhardwareinterrupts
#definefirst_node(src)__first_node(&(src))
staticinlineint__first_node(constnodemask_t*srcp)
{
returnmin_t(int,MAX_NUMNODES,find_first_bit(srcp->bits,MAX_NUMNODES));
}
MoreaboutthiswillbeintheanotherchapterabouttheNUMA.Thenextstepafterthedeclarationoftheselocalvariablesisthecallofthe:
init_irq_default_affinity();
function.Theinit_irq_default_affinityfunctiondefinedinthesamesourcecodefileanddependsontheCONFIG_SMPkernelconfigurationoptionallocatesagivencpumaskstructure(inourcaseitistheirq_default_affinity):
#ifdefined(CONFIG_SMP)
cpumask_var_tirq_default_affinity;
staticvoid__initinit_irq_default_affinity(void)
{
alloc_cpumask_var(&irq_default_affinity,GFP_NOWAIT);
cpumask_setall(irq_default_affinity);
}
#else
staticvoid__initinit_irq_default_affinity(void)
{
}
#endif
Weknowthatwhenahardware,suchasdiskcontrollerorkeyboard,needsattentionfromtheprocessor,itthrowsaninterrupt.Theinterrupttellstotheprocessorthatsomethinghashappenedandthattheprocessorshouldinterruptcurrentprocessandhandleanincomingevent.Inordertopreventmutlipledevicesfromsendingthesameinterrupts,theIRQsystemwasestablishedwhereeachdeviceinacomputersystemisassigneditsownspecialIRQsothatitsinterruptsareunique.LinuxkernelcanassigncertainIRQstospecificprocessors.ThisisknownasSMPIRQaffinity,anditallowsyoucontrolhowyoursystemwillrespondtovarioushardwareevents(that'swhyithascertainimplementationonlyiftheCONFIG_SMPkernelconfigurationoptionisset).Afterweallocatedirq_default_affinitycpumask,wecanseeprintkoutput:
printk(KERN_INFO"NR_IRQS:%d\n",NR_IRQS);
whichprintsNR_IRQS:
~$dmesg|grepNR_IRQS
[0.000000]NR_IRQS:4352
TheNR_IRQSisthemaximumnumberoftheirqdescriptorsorinanotherwordsmaximumnumberofinterrupts.ItsvaluedependsonthestateoftheCOFNIG_X86_IO_APICkernelconfigurationoption.IftheCONFIG_X86_IO_APICisnotsetandtheLinuxkernelusesanoldPICchip,theNR_IRQSis:
#defineNR_IRQS_LEGACY16
#ifdefCONFIG_X86_IO_APIC
...
...
LinuxInside
222Diveintoexternalhardwareinterrupts
...
#else
#defineNR_IRQSNR_IRQS_LEGACY
#endif
Inotherway,whentheCONFIG_X86_IO_APICkernelconfigurationoptionisset,theNR_IRQSdependsontheamountoftheprocessorsandamountoftheinterruptvectors:
#defineCPU_VECTOR_LIMIT(64*NR_CPUS)
#defineNR_VECTORS256
#defineIO_APIC_VECTOR_LIMIT(32*MAX_IO_APICS)
#defineMAX_IO_APICS128
#defineNR_IRQS\
(CPU_VECTOR_LIMIT>IO_APIC_VECTOR_LIMIT?\
(NR_VECTORS+CPU_VECTOR_LIMIT):\
(NR_VECTORS+IO_APIC_VECTOR_LIMIT))
...
...
...
Werememberfromthepreviousparts,thattheamountofprocessorswecansetduringLinuxkernelconfigurationprocesswiththeCONFIG_NR_CPUSconfigurationoption:
Inthefirstcase(CPU_VECTOR_LIMIT>IO_APIC_VECTOR_LIMIT),theNR_IRQSwillbe4352,inthesecondcase(CPU_VECTOR_LIMIT<IO_APIC_VECTOR_LIMIT),theNR_IRQSwillbe768.InmycasetheNR_CPUSis8asyoucanseeinthemyconfiguration,theCPU_VECTOR_LIMITis512andtheIO_APIC_VECTOR_LIMITis4096.SoNR_IRQSformyconfigurationis4352:
~$dmesg|grepNR_IRQS
[0.000000]NR_IRQS:4352
InthenextstepweassignarrayoftheIRQdescriptorstotheirq_descvariablewhichwedefinedinthestartoftheearly_irq_initfunctionandcacluatecountoftheirq_descarraywiththeARRAY_SIZEmacro:
LinuxInside
223Diveintoexternalhardwareinterrupts
desc=irq_desc;
count=ARRAY_SIZE(irq_desc);
Theirq_descarraydefinedinthesamesourcecodefileandlookslike:
structirq_descirq_desc[NR_IRQS]__cacheline_aligned_in_smp={
[0...NR_IRQS-1]={
.handle_irq=handle_bad_irq,
.depth=1,
.lock=__RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),
}
};
Theirq_descisarrayoftheirqdescriptors.Ithasthreealreadyinitializedfields:
handle_irq-asIalreadywroteabove,thisfieldisthehighlevelirq-eventhandler.Inourcaseitinitializedwiththehandle_bad_irqfunctionthatdefinedinthekernel/irq/handle.csourcecodefileandhandlesspuriousandunhandledirqs;depth-0iftheIRQlineisenabledandapositivevalueifithasbeendisabledatleastonce;lock-AspinlockusedtoserializetheaccessestotheIRQdescriptor.
Aswecalculatedcountoftheinterruptsandinitializedourirq_descarray,westarttofilldescriptorsintheloop:
for(i=0;i<count;i++){
desc[i].kstat_irqs=alloc_percpu(unsignedint);
alloc_masks(&desc[i],GFP_KERNEL,node);
raw_spin_lock_init(&desc[i].lock);
lockdep_set_class(&desc[i].lock,&irq_desc_lock_class);
desc_set_defaults(i,&desc[i],node,NULL);
}
Wearegoingthroughtheallinterruptdescriptorsanddothefollowingthings:
Firstofallweallocatepercpuvariablefortheirqkernelstatisticwiththealloc_percpumacro.Thismacroallocatesoneinstanceofanobjectofthegiventypeforeveryprocessoronthesystem.Youcanaccesskernelstatisticfromtheuserspacevia/proc/stat:
~$cat/proc/stat
cpu20790768539045427850143940394000
cpu0258811166846791311351018000
cpu1247911658946799942285024000
cpu22632147154678924664071000
cpu326648869316788914140244000
...
...
...
Wherethesixthcolumnistheservicinginterrupts.Afterthisweallocatecpumaskforthegivenirqdescriptoraffinityandinitializethespinlockforthegiveninterruptdescriptor.Afterthisbeforethecriticalsection,thelockwillbeaqcuiredwithacalloftheraw_spin_lockandunlockedwiththecalloftheraw_spin_unlock.Inthenextstepwecallthelockdep_set_classmacrowhichsettheLockvalidatorirq_desc_lock_classclassforthelockofthegiveninterruptdescriptor.Moreaboutlockdep,spinlockandothersynchronizationprimitiveswillbedescribedintheseparatechapter.
Intheendoftheloopwecallthedesc_set_defaultsfunctionfromthekernel/irq/irqdesc.c.Thisfunctiontakesfourparameters:
numberofairq;
LinuxInside
224Diveintoexternalhardwareinterrupts
interruptdescriptor;onlineNUMAnode;ownerofinterruptdescriptor.Interruptdescriptorscanbeallocatedfrommodules.Thisfieldisneedtoprovedrefcountonthemodulewhichprovidestheinterrupts;
andfillstherestoftheirq_descfields.Thedesc_set_defaultsfunctionfillsinterruptnumber,irqchip,platform-specificper-chipprivatedataforthechipmethods,per-IRQdatafortheirq_chipmethodsandMSIdescriptorfortheperirqandirqchipdata:
desc->irq_data.irq=irq;
desc->irq_data.chip=&no_irq_chip;
desc->irq_data.chip_data=NULL;
desc->irq_data.handler_data=NULL;
desc->irq_data.msi_desc=NULL;
...
...
...
Theirq_data.chipstructureprovidesgeneralAPIliketheirq_set_chip,irq_set_irq_typeandetc,fortheirqcontrollerdrivers.Youcanfinditinthekernel/irq/chip.csourcecodefile.
Afterthiswesetthestatusoftheaccessorforthegivendescriptorandsetdisabledstateoftheinterrupts:
...
...
...
irq_settings_clr_and_set(desc,~0,_IRQ_DEFAULT_INIT_FLAGS);
irqd_set(&desc->irq_data,IRQD_IRQ_DISABLED);
...
...
...
Inthenextstepwesetthehighlevelinterrupthandlerstothehandle_bad_irqwhichhandlesspuriousandunhandledirqs(asthehardwarestuffisnotinitializedyet,wesetthishandler),setirq_desc.descto1whichmeansthatanIRQisdisabled,resetcountoftheunhandledinterruptsandinterruptsingeneral:
...
...
...
desc->handle_irq=handle_bad_irq;
desc->depth=1;
desc->irq_count=0;
desc->irqs_unhandled=0;
desc->name=NULL;
desc->owner=owner;
...
...
...
Afterthiswegothroughtheallpossibleprocessorwiththefor_each_possible_cpuhelperandsetthekstat_irqstozeroforthegiveninterruptdescriptor:
for_each_possible_cpu(cpu)
*per_cpu_ptr(desc->kstat_irqs,cpu)=0;
andcallthedesc_smp_initfunctionfromthekernel/irq/irqdesc.cthatinitializesNUMAnodeofthegiveninterruptdescriptor,setsdefaultSMPaffinityandclearsthepending_maskofthegiveninterruptdescriptordependsonthevalueofthe
LinuxInside
225Diveintoexternalhardwareinterrupts
CONFIG_GENERIC_PENDING_IRQkernelconfigurationoption:
staticvoiddesc_smp_init(structirq_desc*desc,intnode)
{
desc->irq_data.node=node;
cpumask_copy(desc->irq_data.affinity,irq_default_affinity);
#ifdefCONFIG_GENERIC_PENDING_IRQ
cpumask_clear(desc->pending_mask);
#endif
}
Intheendoftheearly_irq_initfunctionwereturnthereturnvalueofthearch_early_irq_initfunction:
returnarch_early_irq_init();
Thisfunctiondefinedinthekernel/apic/vector.candcontainsonlyonecallofthearch_early_ioapic_initfunctionfromthekernel/apic/io_apic.c.Aswecanunderstandfromthearch_early_ioapic_initfunction'sname,thisfunctionmakesearlyinitializationoftheI/OAPIC.Firstofallitmakeacheckofthenumberofthelegacyinterruptswitthecallofthenr_legacy_irqsfunction.IfwehavenolagacyinterruptswiththeIntel8259programmableinterruptcontrollerwesetio_apic_irqstothe0xffffffffffffffff:
if(!nr_legacy_irqs())
io_apic_irqs=~0UL;
AfterthiswearegoingthroughtheallI/OAPICsandallocatespacefortheregisterswiththecallofthealloc_ioapic_saved_registers:
for_each_ioapic(i)
alloc_ioapic_saved_registers(i);
Andintheendofthearch_early_ioapic_initfunctionwearegoingthroughthealllegacyirqs(fromIRQ0toIRQ15)intheloopandallocatespacefortheirq_cfgwhichrepresentsconfigurationofanirqonthegivenNUMAnode:
for(i=0;i<nr_legacy_irqs();i++){
cfg=alloc_irq_and_cfg_at(i,node);
cfg->vector=IRQ0_VECTOR+i;
cpumask_setall(cfg->domain);
}
That'sall.
Wealreadysawinthebeginningofthispartthatimplementationoftheearly_irq_initfunctiondependsontheCONFIG_SPARSE_IRQkernelconfigurationoption.Previouslywesawimplementationoftheearly_irq_initfunctionwhentheCONFIG_SPARSE_IRQconfigurationoptionisnotset,notlet'slookontheitsimplementationwhenthisoptionisset.Implementationofthisfunctionverysimilar,butlittlediffer.Wecanseethesamedefinitionofvariablesandcalloftheinit_irq_default_affinityinthebeginningoftheearly_irq_initfunction:
#ifdefCONFIG_SPARSE_IRQ
int__initearly_irq_init(void)
SparseIRQs
LinuxInside
226Diveintoexternalhardwareinterrupts
{
inti,initcnt,node=first_online_node;
structirq_desc*desc;
init_irq_default_affinity();
...
...
...
}
#else
...
...
...
Butafterthiswecanseethefollowingcall:
initcnt=arch_probe_nr_irqs();
Thearch_probe_nr_irqsfunctiondefinedinthearch/x86/kernel/apic/vector.candcalculatescountofthepre-allocatedirqsandupdatenr_irqswithitsnumber.Butstop.Whytherearepre-allocatedirqs?Thereisalternativeformofinterruptscalled-MessageSignaledInterruptsavailableinthePCI.Insteadofassigningafixednumberoftheinterruptrequest,thedeviceisallowedtorecordamessageataparticularaddressofRAM,infact,thedisplayontheLocalAPIC.MSIpermitsadevicetoallocate1,2,4,8,16or32interruptsandMSI-Xpermitsadevicetoallocateupto2048interrupts.Nowweknowthatirqscanbepre-allocated.MoreaboutMSIwillbeinanextpart,butnowlet'slookonthearch_probe_nr_irqsfunction.Wecanseethecheckwhichassignamountoftheinterruptvectorsfortheeachprocessorinthesystemtothenr_irqsifitisgreaterandcalculatethenrwhichrepresentsnumberofMSIinterrupts:
intnr_irqs=NR_IRQS;
if(nr_irqs>(NR_VECTORS*nr_cpu_ids))
nr_irqs=NR_VECTORS*nr_cpu_ids;
nr=(gsi_top+nr_legacy_irqs())+8*nr_cpu_ids;
Takealookonthegsi_topvariable.EachAPICisidentifiedwithitsownIDandwiththeoffsetwhereitsIRQstarts.ItiscalledGSIbaseorGlobalSystemInterruptbase.Sothegsi_toprepresntersit.WegettheGlobalSystemInterruptbasefromtheMultiProcessorConfigurationTabletable(youcanrememberthatwehaveparsedthistableinthesixthpartoftheLinuxKernelinitializationprocesschapter).
Afterthisweupdatethenrdependsonthevalueofthegsi_top:
#ifdefined(CONFIG_PCI_MSI)||defined(CONFIG_HT_IRQ)
if(gsi_top<=NR_IRQS_LEGACY)
nr+=8*nr_cpu_ids;
else
nr+=gsi_top*16;
#endif
Updatethenr_irqsifitlessthannrandreturnthenumberofthelegacyirqs:
if(nr<nr_irqs)
nr_irqs=nr;
returnnr_legacy_irqs();
}
Thenextafterthearch_probe_nr_irqsisprintinginformationaboutnumberofIRQs:
LinuxInside
227Diveintoexternalhardwareinterrupts
printk(KERN_INFO"NR_IRQS:%dnr_irqs:%d%d\n",NR_IRQS,nr_irqs,initcnt);
Wecanfinditinthedmesgoutput:
$dmesg|grepNR_IRQS
[0.000000]NR_IRQS:4352nr_irqs:48816
Afterthiswedosomechecksthatnr_irqsandinitcntvaluesisnotgreaterthanmaximumallowablenumberofirqs:
if(WARN_ON(nr_irqs>IRQ_BITMAP_BITS))
nr_irqs=IRQ_BITMAP_BITS;
if(WARN_ON(initcnt>IRQ_BITMAP_BITS))
initcnt=IRQ_BITMAP_BITS;
whereIRQ_BITMAP_BITSisequaltotheNR_IRQSiftheCONFIG_SPARSE_IRQisnotsetandNR_IRQS+8196inotherway.Inthenextstepwearegoingoverallinterruptdescriptwhichneedtobeallocatedintheloopandallocatespaceforthedescriptorandinserttotheirq_desc_treeradixtree:
for(i=0;i<initcnt;i++){
desc=alloc_desc(i,node,NULL);
set_bit(i,allocated_irqs);
irq_insert_desc(i,desc);
}
Intheendoftheearly_irq_initfunctionwereturnthevalueofthecallofthearch_early_irq_initfunctionaswediditalreadyinthepreviousvariantwhentheCONFIG_SPARSE_IRQoptionwasnotset:
returnarch_early_irq_init();
That'sall.
ItistheendoftheseventhpartoftheInterruptsandInterruptHandlingchapterandwestartedtodiveintoexternalhardwareinterruptsinthispart.Wesawearlyinitializationoftheirq_descstructurewhichrepresentsdescriptionofanexternalinterruptandcontainsinformationaboutitlikelistofirqactions,informationaboutinterrupthandler,interrupts'sowner,countoftheunhandledinterruptandetc.Inthenextpartwewillcontinuetoresearchexternalinterrupts.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
IRQnumaEnumtype
Conclusion
Links
LinuxInside
228Diveintoexternalhardwareinterrupts
cpumaskpercpuspinlockcriticalsectionLockvalidatorMSII/OAPICLocalAPICIntel8259PICMultiProcessorConfigurationTableradixtreedmesg
LinuxInside
229Diveintoexternalhardwareinterrupts
ThisistheeighthpartoftheInterruptsandInterruptHandlingintheLinuxkernelchapterandinthepreviouspartwestartedtodiveintotheexternalhardwareinterrupts.Welookedontheimplementationoftheearly_irq_initfunctionfromthekernel/irq/irqdesc.csourcecodefileandsawtheinitializationoftheirq_descstructureinthisfunction.Remindthatirq_descstructure(definedintheinclude/linux/irqdesc.histhefoundationofinterruptmanagementcodeintheLinuxkernelandrepresentsaninterruptdescriptor.Inthispartwewillcontinuetodiveintotheinitializationstuffwhichisrelatedtotheexternalhardwareinterrupts.
Rightafterthecalloftheearly_irq_initfunctionintheinit/main.cwecanseethecalloftheinit_IRQfunction.Thisfunctionisarchitecture-specficanddefinedinthearch/x86/kernel/irqinit.c.Theinit_IRQfunctionmakesinitializationofthevector_irqpercpuvariablethatdefinedinthesamearch/x86/kernel/irqinit.csourcecodefile:
...
DEFINE_PER_CPU(vector_irq_t,vector_irq)={
[0...NR_VECTORS-1]=-1,
};
...
andrepresentspercpuarrayoftheinterruptvectornumbers.Thevector_irq_tdefinedinthearch/x86/include/asm/hw_irq.handexpandstothe:
typedefintvector_irq_t[NR_VECTORS];
whereNR_VECTORSiscountofthevectornumberandasyoucanrememberfromthefirstpartofthischapteritis256forthex86_64:
#defineNR_VECTORS256
So,inthestartoftheinit_IRQfunctionwefillthevecto_irqpercpuarraywiththevectornumberofthelegacyinterrupts:
void__initinit_IRQ(void)
{
inti;
for(i=0;i<nr_legacy_irqs();i++)
per_cpu(vector_irq,0)[IRQ0_VECTOR+i]=i;
...
...
...
}
Thisvector_irqwillbeusedduringthefirststepsofanexternalhardwareinterrupthandlinginthedo_IRQfunctionfromthearch/x86/kernel/irq.c:
__visibleunsignedint__irq_entrydo_IRQ(structpt_regs*regs)
{
...
...
InterruptsandInterruptHandling.Part8.
Non-earlyinitializationoftheIRQs
LinuxInside
230Initializationofexternalhardwareinterruptsstructures
...
irq=__this_cpu_read(vector_irq[vector]);
if(!handle_irq(irq,regs)){
...
...
...
}
exiting_irq();
...
...
return1;
}
Whyislegacyhere?ActuallallinterruptshandledbythemodernIO-APICcontroller.Buttheseinterrupts(from0x30to0x3f)bylegacyinterrupt-controllerslikeProgrammableInterruptController.IftheseinterruptsarehandledbytheI/OAPICthenthisvectorspacewillbefreedandre-used.Let'slookonthiscodecloser.Firstofallthenr_legacy_irqsdefinedinthearch/x86/include/asm/i8259.handjustreturnsthenr_legacy_irqsfieldfromthelegacy_picstrucutre:
staticinlineintnr_legacy_irqs(void)
{
returnlegacy_pic->nr_legacy_irqs;
}
Thisstructuredefinedinthesameheaderfileandrepresentsnon-modernprogrammableinterruptscontroller:
structlegacy_pic{
intnr_legacy_irqs;
structirq_chip*chip;
void(*mask)(unsignedintirq);
void(*unmask)(unsignedintirq);
void(*mask_all)(void);
void(*restore_mask)(void);
void(*init)(intauto_eoi);
int(*irq_pending)(unsignedintirq);
void(*make_irq)(unsignedintirq);
};
ActualldefaultmaximumnumberofthelegacyinterruptsreprestentedbytheNR_IRQ_LEGACYmacrofromthearch/x86/include/asm/irq_vectors.h:
#defineNR_IRQS_LEGACY16
Intheloopweareaccessingthevecto_irqper-cpuarraywiththeper_cpumacrobytheIRQ0_VECTOR+iindexandwritethelegacyvectornumberthere.TheIRQ0_VECTORmacrodefinedinthearch/x86/include/asm/irq_vectors.hheaderfileandexpandstothe0x30:
#defineFIRST_EXTERNAL_VECTOR0x20
#defineIRQ0_VECTOR((FIRST_EXTERNAL_VECTOR+16)&~15)
Whyis0x30here?Youcanrememberfromthefirstpartofthischapterthatfirst32vectornumbersfrom0to31arereservedbytheprocessorandusedfortheprocessingofarchitecture-definedexceptionsandinterrupts.Vectornumbersfrom0x30to0x3farereservedfortheISA.So,itmeansthatwefillthevector_irqfromtheIRQ0_VECTORwhichisequaltothe32totheIRQ0_VECTOR+16(beforethe0x30).
LinuxInside
231Initializationofexternalhardwareinterruptsstructures
Intheendoftheinit_IRQfunctiowecanseethecallofthefollowingfunction:
x86_init.irqs.intr_init();
fromthearch/x86/kernel/x86_init.csourcecodefile.IfyouhavereadchapterabouttheLinuxkernelinitializationprocess,youcanrememberthex86_initstructure.Thisstructurecontainsacoupleoffileswhicharepointstothefunctionrelatedtotheplatformsetup(x86_64inourcase),forexampleresources-relatedwiththememoryresources,mpparse-relatedwiththeparsingoftheMultiProcessorConfigurationTabletableandetc.).Aswecanseethex86_initalsocontainstheirqsfieldwhichcontainsthreefollowingfields:
structx86_init_opsx86_init__initdata
{
...
...
...
.irqs={
.pre_vector_init=init_ISA_irqs,
.intr_init=native_init_IRQ,
.trap_init=x86_init_noop,
},
...
...
...
}
Now,weareinterestinginthenative_init_IRQ.Aswecannote,thenameofthenative_init_IRQfunctioncontainsthenative_prefixwhichmeansthatthisfunctionisarchitecture-specific.Itdefinedinthearch/x86/kernel/irqinit.candexecutesgeneralinitializationoftheLocalAPICandinitializationoftheISAirqs.Let'slookontheimplementationofthenative_init_IRQfunctionandwilltrytounderstandwhatoccursthere.Thenative_init_IRQfunctionstartsfromtheexecutionofthefollowingfunction:
x86_init.irqs.pre_vector_init();
Aswecanseeabove,thepre_vector_initpointstotheinit_ISA_irqsfunctionthatdefinedinthesamesourcecodefileandaswecanunderstandfromthefunction'sname,itmakesinitializationoftheISArelatedinterrupts.Theinit_ISA_irqsfunctionstartsfromthedefinitionofthechipvariablewhichhasairq_chiptype:
void__initinit_ISA_irqs(void)
{
structirq_chip*chip=legacy_pic->chip;
...
...
...
Theirq_chipstructuredefinedintheinclude/linux/irq.hheaderfileandrepresentshardwareinterruptchipdescriptor.Itcontains:
name-nameofadevice.Usedinthe/proc/interrupts:
$cat/proc/interrupts
CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7
0:160000000IO-APIC2-edgetimer
1:20000000IO-APIC1-edgei8042
8:10000000IO-APIC8-edgertc0
LinuxInside
232Initializationofexternalhardwareinterruptsstructures
lookonthelastcolumnt;
(*irq_mask)(structirq_data*data)-maskaninterruptsource;(*irq_ack)(structirq_data*data)-startofanewinterrupt;(*irq_startup)(structirq_data*data)-startuptheinterrupt;(*irq_shutdown)(structirq_data*data)-shutdowntheinterruptandetc.
fields.Notethattheirq_datastructurerepresentssetoftheperirqchipdatapasseddowntochipfunctions.Itcontainsmask-precomputedbitmaskforaccessingthechipregisters,irq-interruptnumber,hwirq-hardwareinterruptnumber,localtotheinterruptdomainchiplowlevelinterrupthardwareaccessandetc.
AfterthisdependsontheCONFIG_X86_64andCONFIG_X86_LOCAL_APICkernelconfigurationoptioncalltheinit_bsp_APICfunctionfromthearch/x86/kernel/apic/apic.c:
#ifdefined(CONFIG_X86_64)||defined(CONFIG_X86_LOCAL_APIC)
init_bsp_APIC();
#endif
ThisfunctionmakesinitializationoftheAPICofbootstrapprocessor(orprocessorwhichstartsfirst).ItstartsfromthecheckthatwefoundSMPconfig(readmoreaboutitinthesixthpartoftheLinuxkernelinitializationprocesschapter)andtheprocessorhasAPIC:
if(smp_found_config||!cpu_has_apic)
return;
Inotherwaywereturnfromthisfunction.Inthenextstepwecalltheclear_local_APICfunctionfromthesamesourcecodefilethatshutdownsthelocalAPIC(moreaboutitwillbeinthechapterabouttheAdvancedProgrammableInterruptController)andenableAPICofthefirstprocessorbythesettingunsignedintvaluetotheAPIC_SPIV_APIC_ENABLED:
value=apic_read(APIC_SPIV);
value&=~APIC_VECTOR_MASK;
value|=APIC_SPIV_APIC_ENABLED;
andwritingitwiththehelpoftheapic_writefunction:
apic_write(APIC_SPIV,value);
AfterwehaveenabledAPICforthebootstrapprocessor,wereturntotheinit_ISA_irqsfunctionandinthenextstepweinitalizelegacyProgrammableInterruptControllerandsetthelegacychipandhandlerfortheeachlegacyirq:
legacy_pic->init(0);
for(i=0;i<nr_legacy_irqs();i++)
irq_set_chip_and_handler(i,chip,handle_level_irq);
Wherecanwefindinitfunction?Thelegacy_picdefinedinthearch/x86/kernel/i8259.canditis:
structlegacy_pic*legacy_pic=&default_legacy_pic;
LinuxInside
233Initializationofexternalhardwareinterruptsstructures
Wherethedefault_legacy_picis:
structlegacy_picdefault_legacy_pic={
...
...
...
.init=init_8259A,
...
...
...
}
Theinit_8259AfunctiondefinedinthesamesourcecodefileandexecutesinitializationoftheIntel8259 Programmable
InterruptController(moreaboutitwillbeintheseparatechapterabotProgrammableInterruptControllersandAPIC).
Nowwecanreturntothenative_init_IRQfunction,aftertheinit_ISA_irqsfunctionfinisheditswork.Thenextstepisthecalloftheapic_intr_initfunctionthatallocatesspecialinterruptgateswhichareusedbytheSMParchitecturefortheInter-processorinterrupt.Thealloc_intr_gatemacrofromthearch/x86/include/asm/desc.husedfortheinterruptdescriptorallocationallocation:
#definealloc_intr_gate(n,addr)\
do{\
alloc_system_vector(n);\
set_intr_gate(n,addr);\
}while(0)
Aswecansee,firstofallitexpandstothecallofthealloc_system_vectorfunctionthatchecksthegivenvectornumberintheuser_vectorsbitmap(readpreviouspartaboutit)andifitisnotsetintheuser_vectorsbitmapwesetit.Afterthiswetestthatthefirst_system_vectorisgreaterthangiveninterruptvectornumberandifitisgreaterweassignit:
if(!test_bit(vector,used_vectors)){
set_bit(vector,used_vectors);
if(first_system_vector>vector)
first_system_vector=vector;
}else{
BUG();
}
Wealreadysawtheset_bitmacro,nowlet'slookonthetest_bitandthefirst_system_vector.Thefirsttest_bitmacrodefinedinthearch/x86/include/asm/bitops.handlookslikethis:
#definetest_bit(nr,addr)\
(__builtin_constant_p((nr))\
?constant_test_bit((nr),(addr))\
:variable_test_bit((nr),(addr)))
Wecanseetheternaryoperatorheremakeatestwiththegccbuilt-infunction__builtin_constant_pteststhatgivenvectornumber(nr)isknownatcompiletime.Ifyou'refeelingmisunderstandingofthe__builtin_constant_p,wecanmakesimpletest:
#include<stdio.h>
#definePREDEFINED_VAL1
intmain(){
inti=5;
printf("__builtin_constant_p(i)is%d\n",__builtin_constant_p(i));
LinuxInside
234Initializationofexternalhardwareinterruptsstructures
printf("__builtin_constant_p(PREDEFINED_VAL)is%d\n",__builtin_constant_p(PREDEFINED_VAL));
printf("__builtin_constant_p(100)is%d\n",__builtin_constant_p(100));
return0;
}
andlookontheresult:
$gcctest.c-otest
$./test
__builtin_constant_p(i)is0
__builtin_constant_p(PREDEFINED_VAL)is1
__builtin_constant_p(100)is1
NowIthinkitmustbeclearforyou.Let'sgetbacktothetest_bitmacro.Ifthe__builtin_constant_pwillreturnnon-zero,wecallconstant_test_bitfunction:
staticinlineintconstant_test_bit(intnr,constvoid*addr)
{
constu32*p=(constu32*)addr;
return((1UL<<(nr&31))&(p[nr>>5]))!=0;
}
andthevariable_test_bitinotherway:
staticinlineintvariable_test_bit(intnr,constvoid*addr)
{
u8v;
constu32*p=(constu32*)addr;
asm("btl%2,%1;setc%0":"=qm"(v):"m"(*p),"Ir"(nr));
returnv;
}
What'sthedifferencebetweentwothesefunctionsandwhydoweneedintwodifferentfunctionsforthesamepurpose?Asyoualreadycanguessmainpurposeisoptimization.Ifwewillwritesimpleexamplewiththesefunctions:
#defineCONST25
intmain(){
intnr=24;
variable_test_bit(nr,(int*)0x10000000);
constant_test_bit(CONST,(int*)0x10000000)
return0;
}
andwilllookontheassemblyoutputofourexamplewewillseefollowigassemblycode:
pushq%rbp
movq%rsp,%rbp
movl$268435456,%esi
movl$25,%edi
callconstant_test_bit
fortheconstant_test_bit,and:
LinuxInside
235Initializationofexternalhardwareinterruptsstructures
pushq%rbp
movq%rsp,%rbp
subq$16,%rsp
movl$24,-4(%rbp)
movl-4(%rbp),%eax
movl$268435456,%esi
movl%eax,%edi
callvariable_test_bit
forthevariable_test_bit.Thesetwocodelistingsstartswiththesamepart,firstofallwesavebaseofthecurrentstackframeinthe%rbpregister.Butafterthiscodeforbothexamplesisdifferent.Inthefirstexampleweput$268435456(herethe$268435456isoursecondparameter-0x10000000)totheesiand$25(ourfirstparameter)totheediregisterandcallconstant_test_bit.WeputfunctuinparameterstotheesiandediregistersbecauseaswearelearningLinuxkernelforthex86_64architectureweuseSystemVAMD64ABIcallingconvention.Allisprettysimple.Whenweareusingpredifinedconstant,thecompilercanjustsubstituteitsvalue.Nowlet'slookonthesecondpart.Asyoucanseehere,thecompilercannotsubstitutevaluefromthenrvariable.Inthiscasecompilermustcalcuateitsoffsetontheprogramm'sstackframe.Wesubstract16fromtherspregistertoallocatestackforthelocalvariablesdataandputthe$24(valueofthenrvariable)totherbpwithoffset-4.Ourstackframewillbelikethis:
<-stackgrows
%[rbp]
|
+----------++---------++---------++--------+
|||||return|||
|nr|-||-||-|argc|
|||||address|||
+----------++---------++---------++--------+
|
%[rsp]
Afterthisweputthisvaluetotheeax,soeaxregisternowcontainsvalueofthenr.Intheendwedothesamethatinthefirstexample,weputthe$268435456(thefirstparameterofthevariable_test_bitfunction)andthevalueoftheeax(valueofnr)totheediregister(thesecondparameterofthevariable_test_bitfunction).
Thenextstepaftertheapic_intr_initfunctionwillfinishitsworkisthesettinginterrupgatesfromtheFIRST_EXTERNAL_VECTORor0x20tothe0x256:
i=FIRST_EXTERNAL_VECTOR;
#ifndefCONFIG_X86_LOCAL_APIC
#definefirst_system_vectorNR_VECTORS
#endif
for_each_clear_bit_from(i,used_vectors,first_system_vector){
set_intr_gate(i,irq_entries_start+8*(i-FIRST_EXTERNAL_VECTOR));
}
Butasweareusingthefor_each_clear_bit_fromhelper,wesetonlynon-initializedinterruptgates.Afterthisweusethesamefor_each_clear_bit_fromhelpertofillthenon-filledinterruptgatesintheinterrupttablewiththespurious_interrupt:
#ifdefCONFIG_X86_LOCAL_APIC
for_each_clear_bit_from(i,used_vectors,NR_VECTORS)
set_intr_gate(i,spurious_interrupt);
#endif
Wherethespurious_interruptfunctionrepresentinterrupthandlerfrothespuriousinterrupt.Heretheused_vectorsisthe
LinuxInside
236Initializationofexternalhardwareinterruptsstructures
unsignedlongthatcontainsalreadyinitializedinterruptgates.Wealreadyfilledfirst32interruptvectorsinthetrap_initfunctionfromthearch/x86/kernel/setup.csourcecodefile:
for(i=0;i<FIRST_EXTERNAL_VECTOR;i++)
set_bit(i,used_vectors);
Youcanrememberhowwediditinthesixthpartofthischapter.
Intheendofthenative_init_IRQfunctionwecanseethefollowingcheck:
if(!acpi_ioapic&&!of_ioapic&&nr_legacy_irqs())
setup_irq(2,&irq2);
Firstofalllet'sdealwiththecondition.Theacpi_ioapicvariablerepresentsexistenceofI/OAPIC.Itdefinedinthearch/x86/kernel/acpi/boot.c.Thisvariablesetintheacpi_set_irq_model_ioapicfunctionthatcalledduringtheprocessingMultipleAPICDescriptionTable.Thisoccursduringinitializationofthearchitecture-specificstuffinthearch/x86/kernel/setup.c(moreaboutitwewillknowintheotherchapteraboutAPIC).Notethatthevalueoftheacpi_ioapicvariabledependsontheCONFIG_ACPIandCONFIG_X86_LOCAL_APICLinuxkernelconfigurationoptions.Iftheseoptionsdidnotset,thisvariablewillbejustzero:
#defineacpi_ioapic0
Thesecondcondition-!of_ioapic&&nr_legacy_irqs()checksthatwedonotuseOpenFirmwareI/OAPICandlegacyinterruptcontroller.Wealreadyknowaboutthenr_legacy_irqs.Thesecondisof_ioapicvariabledefinedinthearch/x86/kernel/devicetree.candinitializedinthedtb_ioapic_setupfunctionthatbuildinformationaboutAPICsinthedevicetree.Notethatof_ioapicvariabledependsontheCONFIG_OFLinuxkernelconfigurationopiotn.Ifthisoptionisnotset,thevalueoftheof_ioapicwillbezerotoo:
#ifdefCONFIG_OF
externintof_ioapic;
...
...
...
#else
#defineof_ioapic0
...
...
...
#endif
Iftheconditionwillreturnnon-zerovaulewecallthe:
setup_irq(2,&irq2);
function.Firstofallabouttheirq2.Theirq2istheirqactionstructurethatdefinedinthearch/x86/kernel/irqinit.csourcecodefileandrepresentsIRQ2linethatisusedtoquerydevicesconnectedcascade:
staticstructirqactionirq2={
.handler=no_action,
.name="cascade",
.flags=IRQF_NO_THREAD,
};
LinuxInside
237Initializationofexternalhardwareinterruptsstructures
Sometimeagointerruptcontrollerconsistedoftwochipsandonewasconnectedtosecond.ThesecondchipthatwasconnectedtothefirstchipviathisIRQ2line.Thischipservicedlinesfrom8to15andafterafterthislinesofthefirstchip.So,forexampleIntel8259Ahasfollowinglines:
IRQ0-systemtime;IRQ1-keyboard;IRQ2-usedfordeviceswhicharecascadeconnected;IRQ8-RTC;IRQ9-reserved;IRQ10-reserved;IRQ11-reserved;IRQ12-ps/2mouse;IRQ13-coprocessor;IRQ14-harddrivecontroller;IRQ1-reserved;IRQ3-COM2andCOM4;IRQ4-COM1andCOM3;IRQ5-LPT2;IRQ6-drivecontroller;IRQ7-LPT1.
Thesetup_irqfunctiondefinedinthekernel/irq/manage.candtakestwoparameters:
vectornumberofaninterrupt;irqactionstructurerelatedwithaninterrupt.
Thisfunctioninitializesinterruptdescriptorfromthegivenvectornumberatthebeginning:
structirq_desc*desc=irq_to_desc(irq);
Andcallthe__setup_irqfunctionthatsetupsgiveninterrupt:
chip_bus_lock(desc);
retval=__setup_irq(irq,desc,act);
chip_bus_sync_unlock(desc);
returnretval;
Notethattheinterruptdescriptorislockedduring__setup_irqfunctionwillwork.The__setup_irqfunctionmakesmanydifferentthings:Itcreatesahandlerthreadwhenathreadfunctionissuppliedandtheinterruptdoesnotnestintoanotherinterruptthread,setstheflagsofthechip,fillstheirqactionstructureandmanymanymore.
Alloftheaboveitcreates/prov/vector_numberdirectoryandfillsit,butifyouareusingmoderncomputerallvalueswillbezerothere:
$cat/proc/irq/2/node
0
$cat/proc/irq/2/affinity_hint
00
cat/proc/irq/2/spurious
count0
unhandled0
last_unhandled0ms
LinuxInside
238Initializationofexternalhardwareinterruptsstructures
becauseprobablyAPIChandlesinterruptsontheourmachine.
That'sall.
ItistheendoftheeighthpartoftheInterruptsandInterruptHandlingchapterandwecontinuedtodiveintoexternalhardwareinterruptsinthispart.InthepreviouspartwestartedtodoitandsawearlyinitializationoftheIRQs.Inthispartwealreadysawnon-earlyinterruptsinitializationintheinit_IRQfunction.Wesawinitializationofthevector_irqper-cpuarraywhichisstorevectornumbersoftheinterruptsandwillbeusedduringinterrupthandlingandinitializationofotherstuffwhichisrelatedtotheexternalhardwareinterrupts.
Inthenextpartwewillcontinuetolearninterruptshandlingrelatedstuffandwillseeinitializationofthesoftirqs.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
IRQpercpux86_64Intel8259ProgrammableInterruptControllerISAMultiProcessorConfigurationTableLocalAPICI/OAPICSMPInter-processorinterruptternaryoperatorgcccallingconventionPDF.SystemVApplicationBinaryInterfaceAMD64CallstackOpenFirmwaredevicetreeRTCPreviouspart
Conclusion
Links
LinuxInside
239Initializationofexternalhardwareinterruptsstructures
Itistheninthpartofthelinux-insidesbookandinthepreviousPreviouspartwesawimplementationoftheinit_IRQfromthatdefinedinthearch/x86/kernel/irqinit.csourcecodefile.So,wewillcontinuetodiveintotheinitializationstuffwhichisrelatedtotheexternalhardwareinterruptsinthispart.
Aftertheinit_IRQfunctionwecanseethecallofthesoftirq_initfunctionintheinit/main.c.Thisfunctiondefinedinthekernel/softirq.csourcecodefileandaswecanunderstandfromitsname,thisfunctionmakesinitializationofthesoftirqorinotherwordsinitializationofthedeferredinterrupts.Whatisitdeferreedintrrupt?WealreadysawalittlebitaboutitintheninthpartofthechapterthatdescribesinitializationprocessoftheLinuxkernel.TherearethreetypesofdefferedinterruptsintheLinuxkernel:
softirqs;tasklets;workqueues;
Andwewillseedescriptionofallofthesetypesinthispart.AsIsaid,wesawonlyalittlebitaboutthistheme,so,nowistimetodivedeepintodetailsaboutthistheme.
Interruptsmayhavedifferentimportantcharacteristicsandtherearetwoamongthem:
Handlerofaninterruptmustexecutequickly;Sometimeaninterrupthandlermustdoalargeamountofwork.
Asyoucanunderstand,itisalmostimpossibletomakesothatbothcharacteristicswerevalid.Becauseofthese,previouslythehandlingofinterruptswassplittedintotwoparts:
Tophalf;Bottomhalf;
OncetheLinuxkernelwasoneofthewaystheorganizationpostprocessing,andwhichwascalled:thebottomhalfoftheprocessor,butnowitisalreadynotactual.Nowthistermhasremainedasacommonnounreferringtoallthedifferentwaysoforganizingdefferedprocessingofaninterrupt.WiththeadventofparallelismsintheLinuxkernel,allnewschemesofimplementationofthebottomhalfhandlersarebuiltontheperformanceoftheprocessorspecifickernelthreadthatcalledksoftirqd(willbediscussedbelow).Thesoftirqmechanismrepresentshandlingofinterruptsthatarealmostasimportantasthehandlingofthehardwareinterrupts.Thedeferredprocessingofaninterruptsuggeststhatsomeoftheactionsforaninterruptmaybepostponedtoalaterexecutionwhenthesystemwillbelessloaded.Asyoucansuggests,aninterrupthandlercandolargeamountofworkthatisimpermissibleasitexecutesinthecontextwhereinterruptsaredisabled.That'swhyprocessingofaninterruptcanbesplittedontwodifferentparts.Inthefirstpart,themainhandlerofaninterruptdoesonlyminimalandthemostimportantjob.Afterthisitschedulesthesecondpartandfinishesitswork.Whenthesystemislessbusyandcontextoftheprocessorallowstohandleinterrupts,thesecondpartstartsitsworkandfinishestoprocessremaingpartofadeferredinterrupt.Thatismainexplanationofthedeferredinterrupthandling.
AsIalreadywroteabove,handlingofdeferredinterrupts(orsoftirqinotherwords)andaccordinglytaskletsisperformedbyasetofthespecialkernelthreads(onethreadperprocessor).Eachprocessorhasitsownthreadthatis
InterruptsandInterruptHandling.Part9.
Introductiontodeferredinterrupts(Softirq,TaskletsandWorkqueues)
Deferredinterrupts
LinuxInside
240Softirq,TaskletsandWorkqueues
calledksoftirqd/nwherethenisthenumberoftheprocessor.Wecanseeitintheoutputofthesystemd-cglsutil:
$systemd-cgls-k|grepksoft
├─3[ksoftirqd/0]
├─13[ksoftirqd/1]
├─18[ksoftirqd/2]
├─23[ksoftirqd/3]
├─28[ksoftirqd/4]
├─33[ksoftirqd/5]
├─38[ksoftirqd/6]
├─43[ksoftirqd/7]
Thespawn_ksoftirqdfunctionstartsthisthesethreads.Aswecanseethisfunctioncalledasearlyinitcall:
early_initcall(spawn_ksoftirqd);
Deferredinterruptsaredeterminedstaticallyatcompile-timeoftheLinuxkernelandtheopen_softirqfunctiontakescareofsoftirqinitialization.Theopen_softirqfunctiondefinedinthekernel/softirq.c:
voidopen_softirq(intnr,void(*action)(structsoftirq_action*))
{
softirq_vec[nr].action=action;
}
andaswecanseethisfunctionusestwoparameters:
theindexofthesoftirq_vecarray;apointertothesoftirqfunctiontobeexecuted;
Firstofalllet'slookonthesoftirq_vecarray:
staticstructsoftirq_actionsoftirq_vec[NR_SOFTIRQS]__cacheline_aligned_in_smp;
itdefinedinthesamesourcecodefile.Aswecansee,thesoftirq_vecarraymaycontainNR_SOFTIRQSor10typesofsoftirqsthathastypesoftirq_action.Firstofallaboutitselements.InthecurrentversionoftheLinuxkerneltherearetensoftirqvectorsdefined;twofortaskletprocessing,twofornetworking,twofortheblocklayer,twofortimers,andoneeachfortheschedulerandread-copy-updateprocessing.Allofthesekindsarerepresentedbythefollowingenum:
enum
{
HI_SOFTIRQ=0,
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
BLOCK_SOFTIRQ,
BLOCK_IOPOLL_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,
RCU_SOFTIRQ,
NR_SOFTIRQS
};
Allnamesofthesekindsofsoftirqsarerepresentedbythefollowingarray:
LinuxInside
241Softirq,TaskletsandWorkqueues
constchar*constsoftirq_to_name[NR_SOFTIRQS]={
"HI","TIMER","NET_TX","NET_RX","BLOCK","BLOCK_IOPOLL",
"TASKLET","SCHED","HRTIMER","RCU"
};
Orwecanseeitintheoutputofthe/proc/softirqs:
~$cat/proc/softirqs
CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7
HI:50000000
TIMER:332519310498289555272913282535279467282895270979
NET_TX:23200021100
NET_RX:270221225338281311262430265
BLOCK:13428232401012788
BLOCK_IOPOLL:00000000
TASKLET:1968352300000
SCHED:161852146745129539126064127998128014120243117391
HRTIMER:00000000
RCU:337707289397251874239796254377254898267497256624
Aswecanseethesoftirq_vecarrayhassoftirq_actiontypes.Thisisthemaindatastructurerelatedtothesoftirqmechanism,soallsoftirqsrepresentedbythesoftirq_actionstructure.Thesoftirq_actionstructureconsistsasinglefieldonly:anactionpointertothesoftirqfunction:
structsoftirq_action
{
void(*action)(structsoftirq_action*);
};
So,afterthiswecanunderstandthattheopen_softirqfunctionfillsthesoftirq_vecarraywiththegivensoftirq_action.Theregistereddeferredinterrupt(withthecalloftheopen_softirqfunction)forittobequeuedforexecution,itshouldbeactivatedbythecalloftheraise_softirqfunction.Thisfunctiontakesonlyoneparameter--asoftirqindexnr.Let'slookonitsimplementation:
voidraise_softirq(unsignedintnr)
{
unsignedlongflags;
local_irq_save(flags);
raise_softirq_irqoff(nr);
local_irq_restore(flags);
}
Herewecanseethecalloftheraise_softirq_irqofffunctionbetweenthelocal_irq_saveandthelocal_irq_restoremacros.Thelocal_irq_savedefinedintheinclude/linux/irqflags.hheaderfileandsavesthestateoftheIFflagoftheeflagsregisteranddisablesinterruptsonthelocalprocessor.Thelocal_irq_restoremacrodefinedinthesameheaderfileanddoestheoppositething:restorestheinterruptflagandenablesinterrupts.Wedisableinterruptsherebecauseasoftirqinterruptrunsintheinterruptcontextandthatonesoftirq(andnoothers)willberun.
Theraise_softirq_irqofffunctionmarksthesoftirqasdefferedbysettingthebitcorrespondingtothegivenindexnrinthesoftirqbitmask(__softirq_pending)ofthelocalprocessor.Itdoesitwiththehelpofthe:
__raise_softirq_irqoff(nr);
macro.Afterthis,itcheckstheresultofthein_interruptthatreturnsirq_countvalue.Wealreadysawtheirq_countin
LinuxInside
242Softirq,TaskletsandWorkqueues
thefirstpartofthischapteranditisusedtocheckifaCPUisalreadyonaninterruptstackornot.Wejustexitfromtheraise_softirq_irqoff,restoreIFflangandenableinterruptsonthelocalprocessor,ifweareintheinterruptcontext,otherwisewecallthewakeup_softirqd:
if(!in_interrupt())
wakeup_softirqd();
Wherethewakeup_softirqdfunctionactivatestheksoftirqdkernelthreadofthelocalprocessor:
staticvoidwakeup_softirqd(void)
{
structtask_struct*tsk=__this_cpu_read(ksoftirqd);
if(tsk&&tsk->state!=TASK_RUNNING)
wake_up_process(tsk);
}
Eachksoftirqdkernelthreadrunstherun_ksoftirqdfunctionthatchecksexistenceofdeferredinterruptsandcallsthe__do_softirqfunctiondependsonresult.Thisfunctionreadsthe__softirq_pendingsoftirqbitmaskofthelocalprocessorandexecutesthedeferrablefunctionscorrespondingtoeverybitset.Duringexecutionofadeferredfunction,newpendingsoftirqsmightoccur.Themainproblemherethatexecutionoftheuserspacecodecanbedelayedforalongtimewhilethe__do_softirqfunctionwillhandledeferredinterrupts.Forthispurpose,ithasthelimitofthetimewhenitmustbefinsihed:
unsignedlongend=jiffies+MAX_SOFTIRQ_TIME;
...
...
...
restart:
while((softirq_bit=ffs(pending))){
...
h->action(h);
...
}
...
...
...
pending=local_softirq_pending();
if(pending){
if(time_before(jiffies,end)&&!need_resched()&&
--max_restart)
gotorestart;
}
...
Checksoftheexistenceofthedeferredinterruptsperformedperiodicallyandtherearesomepointswherethischeckoccurs.Themainpointwherethissituationoccursisthecallofthedo_IRQfunctionthatdefinedinthearch/x86/kernel/irq.candprovidesmainpossibilitiesforactualinterruptprocessingintheLinuxkernel.Whenthisfunctionwillfinishtohandleaninterrupt,itcallstheexiting_irqfunctionfromthearch/x86/include/asm/apic.hthatexpandstothecalloftheirq_exitfunction.Theirq_exitchecksdeferredinterrupts,currentcontextandcallstheinvoke_softirqfunction:
if(!in_interrupt()&&local_softirq_pending())
invoke_softirq();
thatexecutesthe__do_softirqtoo.Sowhatdowehaveinsummary.Eachsoftirqgoesthroughthefollowingstages:Registrationofasoftirqwiththeopen_softirqfunction.Activationofasoftirqbymarkingitasdeferredwiththeraise_softirqfunction.Afterthis,allmarkedsoftirqswillberunnedinthenexttimetheLinuxkernelschedulesaround
LinuxInside
243Softirq,TaskletsandWorkqueues
ofexecutionsofdeferrablefunctions.Andexecutionofthedeferredfunctionsthathavethesametype.
AsIalreadywrote,thesoftirqsarestaticallyallocatedanditisaproblemforakernelmodulethatcanbeloaded.Thesecondconceptthatbuiltontopofsoftirq--thetaskletssolvesthisproblem.
IfyoureadthesourcecodeoftheLinuxkernelthatisrelatedtothesoftirq,younoticethatitisusedveryrarely.Thepreferablewaytoimplementdeferrablefunctionsaretasklets.AsIalreadywroteabovethetaskletsarebuiltontopofthesoftirqconceptandgenerallyontopoftwosoftirqs:
TASKLET_SOFTIRQ;HI_SOFTIRQ.
Inshortwords,taskletsaresoftirqsthatcanbeallocatedandinitializedatruntimeandunlikesoftirqs,taskletsthathavethesametypecannotberunonmultipleprocessorsatatime.Ok,nowweknowalittlebitaboutthesoftirqs,ofcourseprevioustextdoesnotcoverallaspectsaboutthis,butnowwecandirectlylookonthecodeandtoknowmoreaboutthesoftirqsstepbysteponpracticeandtoknowabouttasklets.Let'sreturnbacktotheimplementationofthesoftirq_initfunctionthatwetalkedaboutinthebeginningofthispart.Thisfunctionisdefinedinthekernel/softirq.csourcecodefile,let'slookonitsimplementation:
void__initsoftirq_init(void)
{
intcpu;
for_each_possible_cpu(cpu){
per_cpu(tasklet_vec,cpu).tail=
&per_cpu(tasklet_vec,cpu).head;
per_cpu(tasklet_hi_vec,cpu).tail=
&per_cpu(tasklet_hi_vec,cpu).head;
}
open_softirq(TASKLET_SOFTIRQ,tasklet_action);
open_softirq(HI_SOFTIRQ,tasklet_hi_action);
}
Wecanseedefinitionoftheintegercpuvariableatthebeginningofthesoftirq_initfunction.Nextwewilluseitasparameterforthefor_each_possible_cpumacrothatgoesthroughtheallpossibleprocessorsinthesystem.Ifthepossibleprocessoristhenewterminologyforyou,youcanreadmoreaboutittheCPUmaskschapter.Inshortwords,possiblecpusisthesetofprocessorsthatcanbepluggedinanytimeduringthelifeofthatsystemboot.Allpossibleprocessorsstoredinthecpu_possible_bitsbitmap,youcanfinditsdefinitioninthekernel/cpu.c:
staticDECLARE_BITMAP(cpu_possible_bits,CONFIG_NR_CPUS)__read_mostly;
...
...
...
conststructcpumask*constcpu_possible_mask=to_cpumask(cpu_possible_bits);
Ok,wedefinedtheintegercpuvariableandgothroughtheallpossibleprocessorswiththefor_each_possible_cpumacroandmakesinitializationofthetwofollowingper-cpuvariables:
tasklet_vec;tasklet_hi_vec;
Thesetwoper-cpuvariablesdefinedinthesamesourcecodefileasthesoftirq_initfunctionandrepresenttwotasklet_headstructures:
Tasklets
LinuxInside
244Softirq,TaskletsandWorkqueues
staticDEFINE_PER_CPU(structtasklet_head,tasklet_vec);
staticDEFINE_PER_CPU(structtasklet_head,tasklet_hi_vec);
Wheretasklet_headstructurerepresentsalistofTaskletsandcontainstwofields,headandtail:
structtasklet_head{
structtasklet_struct*head;
structtasklet_struct**tail;
};
Thetasklet_structstructureisdefinedintheinclude/linux/interrupt.handrepresentstheTasklet.Previouslywedidnotseethiswordinthisbook.Let'strytounderstandwhatthetaskletis.Actually,thetaskletisoneofmechanismstohandledeferredinterrupt.Let'slookontheimplementationofthetasklet_structstructure:
structtasklet_struct
{
structtasklet_struct*next;
unsignedlongstate;
atomic_tcount;
void(*func)(unsignedlong);
unsignedlongdata;
};
Aswecanseethisstructurecontainsfivefields,theyare:
Nexttaskletintheschedulingqueue;Stateofthetasklet;Representcurrentstateofthetasklet,activeornot;Maincallbackofthetasklet;Parameterofthecallback.
Inourcase,wesetonlyforinitializeonlytwoarraysoftaskletsinthesoftirq_initfunction:thetasklet_vecandthetasklet_hi_vec.Taskletsandhigh-prioritytaskletsarestoredinthetasklet_vecandtasklet_hi_vecarrays,respectively.So,wehaveinitializedthesearraysandnowwecanseetwocallsoftheopen_softirqfunctionthatisdefinedinthekernel/softirq.csourcecodefile:
open_softirq(TASKLET_SOFTIRQ,tasklet_action);
open_softirq(HI_SOFTIRQ,tasklet_hi_action);
attheendofthesoftirq_initfunction.Themainpurposeoftheopen_softirqfunctionistheinitalizationofsoftirq.Let'slookontheimplementationoftheopen_softirqfunction.
,inourcasetheyare:tasklet_actionandthetasklet_hi_actionorthesoftirqfunctionassociatedwiththeHI_SOFTIRQsoftirqisnamedtasklet_hi_actionandsoftirqfunctionassociatedwiththeTASKLET_SOFTIRQisnamedtasklet_action.TheLinuxkernelprovidesAPIforthemanipulatingoftasklets.Firstofallitisthetasklet_initfunctionthattakestasklet_struct,functionandparameterforitandinitializesthegiventasklet_structwiththegivendata:
voidtasklet_init(structtasklet_struct*t,
void(*func)(unsignedlong),unsignedlongdata)
{
t->next=NULL;
t->state=0;
atomic_set(&t->count,0);
t->func=func;
t->data=data;
LinuxInside
245Softirq,TaskletsandWorkqueues
}
Thereareadditionalmethodstoinitializeataskletstaticallywiththetwofollowingmacros:
DECLARE_TASKLET(name,func,data);
DECLARE_TASKLET_DISABLED(name,func,data);
TheLinuxkernelprovidesthreefollowingfunctionstomarkataskletasreadytorun:
voidtasklet_schedule(structtasklet_struct*t);
voidtasklet_hi_schedule(structtasklet_struct*t);
voidtasklet_hi_schedule_first(structtasklet_struct*t);
Thefirstfunctionschedulesataskletwiththenormalpriority,thesecondwiththehighpriorityandthethirdoutofturn.Implementationoftheallofthesethreefunctionsissimilar,sowewillconsideronlythefirst--tasklet_schedule.Let'slookonitsimplementation:
staticinlinevoidtasklet_schedule(structtasklet_struct*t)
{
if(!test_and_set_bit(TASKLET_STATE_SCHED,&t->state))
__tasklet_schedule(t);
}
void__tasklet_schedule(structtasklet_struct*t)
{
unsignedlongflags;
local_irq_save(flags);
t->next=NULL;
*__this_cpu_read(tasklet_vec.tail)=t;
__this_cpu_write(tasklet_vec.tail,&(t->next));
raise_softirq_irqoff(TASKLET_SOFTIRQ);
local_irq_restore(flags);
}
AswecanseeitchecksandsetsthestateofthegiventasklettotheTASKLET_STATE_SCHEDandexecutesthe__tasklet_schedulewiththegiventasklet.The__tasklet_schedulelooksverysimilartotheraise_softirqfunctionthatwesawabove.Itsavestheinterruptflaganddisablesinterruptsatthebeginning.Afterthis,itupdatestasklet_vecwiththenewtaskletandcallstheraise_softirq_irqofffunctionthatwesawabove.WhentheLinuxkernelschedulerwilldecidetorundeferredfunctions,thetasklet_actionfunctionwillbecalledfordeferredfunctionswhichareassociatedwiththeTASKLET_SOFTIRQandtasklet_hi_actionfordeferredfunctionswhichareassociatedwiththeHI_SOFTIRQ.Thesefunctionsareverysimilarandthereisonlyonedifferencebetweenthem--tasklet_actionusestasklet_vecandtasklet_hi_actionusestasklet_hi_vec.
Let'slookontheimplementationofthetasklet_actionfunction:
staticvoidtasklet_action(structsoftirq_action*a)
{
local_irq_disable();
list=__this_cpu_read(tasklet_vec.head);
__this_cpu_write(tasklet_vec.head,NULL);
__this_cpu_write(tasklet_vec.tail,this_cpu_ptr(&tasklet_vec.head));
local_irq_enable();
while(list){
if(tasklet_trylock(t)){
t->func(t->data);
tasklet_unlock(t);
}
LinuxInside
246Softirq,TaskletsandWorkqueues
...
...
...
}
}
Inthebeginningofthetasketl_actionfunction,wedisableinterruptsforthelocalprocessorwiththehelpofthelocal_irq_disablemacro(youcanreadaboutthismacrointhesecondpartofthischapter).Inthenextstep,wetakeaheadofthelistthatcontainstaskletswithnormalpriorityandsetthisper-cpulisttoNULLbecausealltaskletsmustbeexecutedinageneralyway.Afterthisweenableinterruptsforthelocalprocessorandgothroughthelistoftakletsintheloop.Ineveryiterationoftheloopwecallthetasklet_trylockfunctionforthegiventaskletthatupdatesstateofthegiventaskletonTASKLET_STATE_RUN:
staticinlineinttasklet_trylock(structtasklet_struct*t)
{
return!test_and_set_bit(TASKLET_STATE_RUN,&(t)->state);
}
Ifthisoperationwassuccessfulweexecutetasklet'saction(itwassetinthetasklet_init)andcallthetasklet_unlockfunctionthatclearstasklet'sTASKLET_STATE_RUNstate.
Ingeneral,that'sallabouttaskletsconcept.Ofcoursethisdoesnotcoverfulltasklets,butIthinkthatitisagoodpointfromwhereyoucancontinuetolearnthisconcept.
ThetaskletsarewidelyusedconceptintheLinuxkernel,butasIwroteinthebeginningofthispartthereisthirdmechanismfordeferredfunctions--workqueue.Inthenextparagraphwewillseewhatitis.
Theworkqueueisanotherconceptforhandlingdeferredfunctions.Itissimilartotaskletswithsomedifferences.Workqueuefunctionsruninthecontextofakernelprocess,buttaskletfunctionsruninthesoftwareinterruptcontext.Thismeansthatworkqueuefunctionsmustnotbeatomicastaskletfunctions.Taskletsalwaysrunontheprocessorfromwhichtheywereoriginallysubmitted.Workqueuesworkinthesameway,butonlybydefault.Theworkqueueconceptrepresentedbythe:
structworker_pool{
spinlock_tlock;
intcpu;
intnode;
intid;
unsignedintflags;
structlist_headworklist;
intnr_workers;
...
...
...
structurethatisdefinedinthekernel/workqueue.csourcecodefileintheLinuxkernel.Iwillnotwritethesourcecodeofthisstructurehere,becauseithasquitealotoffields,butwewillconsidersomeofthosefields.
Initsmostbasicform,theworkqueuesubsystemisaninterfaceforcreatingkernelthreadstohandleworkthatisqueuedfromelsewhere.Allofthesekernelthreadsarecalled--workerthreads.Theworkqueuearemaintainedbythework_structthatdefinedintheinclude/linux/workqueue.h.Let'slookonthisstructure:
Workqueues
LinuxInside
247Softirq,TaskletsandWorkqueues
structwork_struct{
atomic_long_tdata;
structlist_headentry;
work_func_tfunc;
#ifdefCONFIG_LOCKDEP
structlockdep_maplockdep_map;
#endif
};
Herearetwothingsthatweareinterested:func--thefunctionthatwillbescheduledbytheworkqueueandthedata-parameterofthisfunction.TheLinuxkernelprovidesspecialper-cputhreadsthatarecalledkworker:
systemd-cgls-k|grepkworker
├─5[kworker/0:0H]
├─15[kworker/1:0H]
├─20[kworker/2:0H]
├─25[kworker/3:0H]
├─30[kworker/4:0H]
...
...
...
Thisprocesscanbeusedtoschedulethedeferredfunctionsoftheworkqueues(asksoftirqdforsoftirqs).Besidesthiswecancreatenewseparateworkerthreadforaworkqueue.TheLinuxkernelprovidesfollowingmacrosforthecreationofworkqueue:
#defineDECLARE_WORK(n,f)\
structwork_structn=__WORK_INITIALIZER(n,f)
forstaticcreation.Ittakestwoparameters:nameoftheworkqueueandtheworkqueuefunction.Forcreationofworkqueueinruntime,wecanusethe:
#defineINIT_WORK(_work,_func)\
__INIT_WORK((_work),(_func),0)
#define__INIT_WORK(_work,_func,_onstack)\
do{\
__init_work((_work),_onstack);\
(_work)->data=(atomic_long_t)WORK_DATA_INIT();\
INIT_LIST_HEAD(&(_work)->entry);\
(_work)->func=(_func);\
}while(0)
macrothattakeswork_structstructurethathastobecreatedandthefunctiontobescheduledinthisworkqueue.Afteraworkwascreatedwiththeoneofthesemacros,weneedtoputittotheworkqueue.Wecandoitwiththehelpofthequeue_workorthequeue_delayed_workfunctions:
staticinlineboolqueue_work(structworkqueue_struct*wq,
structwork_struct*work)
{
returnqueue_work_on(WORK_CPU_UNBOUND,wq,work);
}
Thequeue_workfunctionjustcallsthequeue_work_onfunctionthatqueueworkonspecificprocessor.NotethatinourcasewepasstheWORK_STRUCT_PENDING_BITtothequeue_work_onfunction.Itisapartoftheenumthatisdefinedintheinclude/linux/workqueue.handrepresentsworkqueuewhicharenotboundtoanyspecificprocessor.Thequeue_work_onfunctiontestsandsettheWORK_STRUCT_PENDING_BITbitofthegivenworkandexecutesthe__queue_workfunctionwiththe
LinuxInside
248Softirq,TaskletsandWorkqueues
workqueueforthegivenprocessorandgivenwork:
__queue_work(cpu,wq,work);
The__queue_workfunctiongetstheworkpool.Yes,theworkpoolnotworkqueue.Actually,allworksarenotplacedintheworkqueue,buttotheworkpoolthatisrepresentedbytheworker_poolstructureintheLinuxkernel.Asyoucanseeabove,theworkqueue_structstructurehasthepwqsfieldwhichislistofworker_pools.Whenwecreateaworkqueue,itstandsoutforeachprocessorthepool_workqueue.Eachpool_workqueueassociatedwithworker_pool,whichisallocatedonthesameprocessorandcorrespondstothetypeofpriorityqueue.Throughthemworkqueueinteractswithworker_pool.Sointhe__queue_workfunctionwesetthecputothecurrentprocessorwiththeraw_smp_processor_id(youcanfindinformationaboutthismarcointhefouthpartoftheLinuxkernelinitializationprocesschapter),gettingthepool_workqueueforthegivenworkqueue_structandinsertthegivenworktothegivenworkqueue:
staticvoid__queue_work(intcpu,structworkqueue_struct*wq,
structwork_struct*work)
{
...
...
...
if(req_cpu==WORK_CPU_UNBOUND)
cpu=raw_smp_processor_id();
if(!(wq->flags&WQ_UNBOUND))
pwq=per_cpu_ptr(wq->cpu_pwqs,cpu);
else
pwq=unbound_pwq_by_node(wq,cpu_to_node(cpu));
...
...
...
insert_work(pwq,work,worklist,work_flags);
Aswecancreateworksandworkqueue,weneedtoknowwhentheyareexecuted.AsIalreadywrote,allworksareexecutedbythekernelthread.Whenthiskernelthreadisscheduled,itstartstoexecuteworksfromthegivenworkqueue.Eachworkerthreadexecutesaloopinsidetheworker_threadfunction.Thisthreadmakesmanydifferentthingsandpartofthesethingsaresimilartowhatwesawbeforeinthispart.Asitstartsexecuting,itremovesallwork_structorworksfromitsworkqueue.
That'sall.
ItistheendoftheninthpartoftheInterruptsandInterruptHandlingchapterandwecontinuedtodiveintoexternalhardwareinterruptsinthispart.InthepreviouspartwesawinitializationoftheIRQsandmainirq_descstructure.Inthispartwesawthreeconcepts:thesoftirq,taskletandworkqueuethatareusedforthedeferredfunctions.
ThenextpartwillbelastpartoftheInterruptsandInterruptHandlingchapterandwewilllookontherealhardwaredriverandwilltrytolearnhowitworkswiththeinterruptssubsystem.
Ifyouhaveanyquestionsorsuggestions,writemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyoufindanymistakespleasesendmePRtolinux-internals.
Conclusion
Links
LinuxInside
249Softirq,TaskletsandWorkqueues
initcallIFeflagsCPUmasksper-cpuWorkqueuePreviouspart
LinuxInside
250Softirq,TaskletsandWorkqueues
ThisisthetenthpartofthechapteraboutinterruptsandinterrupthandlingintheLinuxkernelandinthepreviouspartwesawalittleaboutdeferredinterruptsandrelatedconceptslikesoftirq,taskletandworkqeue.Inthispartwewillcontinuetodiveintothisthemeandnowit'stimetolookatrealhardwaredriver.
Let'sconsiderserialdriveroftheStrongARM**SA-110/21285EvaluationBoardboardforexampleandwilllookhowthisdriverrequestsanIRQline,whathappenswhenaninterruptistriggeredandetc.Thesourcecodeofthisdriverisplacedinthedrivers/tty/serial/21285.csourcecodefile.Ok,wehavesourcecode,let'sstart.
Wewillstarttoconsiderthisdriverasweusuallydiditwithallnewconceptsthatwesawinthisbook.Wewillstarttoconsideritfromtheintialization.Asyoualreadymayknow,theLinuxkernelprovidestwomacrosforinitializationandfinalizationofadriverorakernelmodule:
module_init;module_exit.
Andwecanfindusageofthesemacrosinourdriversourcecode:
module_init(serial21285_init);
module_exit(serial21285_exit);
ThemostpartofdevicedriverscanbecompiledasaloadablekernelmoduleorinanotherwaytheycanbestaticallylinkedintotheLinuxkernel.Inthefirstcaseinitializationofadevicedriverwillbeproducedviathemodule_initandmodule_Exitmacrosthataredefinedintheinclude/linux/init.h:
#definemodule_init(initfn)\
staticinlineinitcall_t__inittest(void)\
{returninitfn;}\
intinit_module(void)__attribute__((alias(#initfn)));
#definemodule_exit(exitfn)\
staticinlineexitcall_t__exittest(void)\
{returnexitfn;}\
voidcleanup_module(void)__attribute__((alias(#exitfn)));
andwillbecalledbytheinitcallfunctions:
early_initcall
pure_initcall
core_initcall
postcore_initcall
arch_initcall
subsys_initcall
fs_initcall
rootfs_initcall
device_initcall
InterruptsandInterruptHandling.Part10.
Lastpart
Initializationofakernelmodule
LinuxInside
251Lastpart
late_initcall
thatarecalledinthedo_initcallsfromtheinit/main.c.Otherwise,ifadevicedriverisstaticallylinkedintotheLinuxkernel,implementationofthesemacroswillbefollowing:
#definemodule_init(x)__initcall(x);
#definemodule_exit(x)__exitcall(x);
Inthiswayimplementationofmoduleloadingplacedinthekernel/module.csourcecodefileandinitializationoccursinthedo_init_modulefunction.Wewillnotdiveintodetailsaboutloadablemodulesinthischapter,butwillseeitinthespecialchapterthatwilldescribeLinuxkernelmodules.Ok,themodule_initmacrotakesoneparameter-theserial21285_initinourcase.Aswecanunderstandfromfunction'sname,thisfunctiondoesstuffrelatedtothedriverinitialization.Let'slookatit:
staticint__initserial21285_init(void)
{
intret;
printk(KERN_INFO"Serial:21285driver\n");
serial21285_setup_ports();
ret=uart_register_driver(&serial21285_reg);
if(ret==0)
uart_add_one_port(&serial21285_reg,&serial21285_port);
returnret;
}
Aswecansee,firstofallitprintsinformationaboutthedrivertothekernelbufferandthecalloftheserial21285_setup_portsfunction.Thisfunctionsetupsthebaseuartclockoftheserial21285_portdevice:
unsignedintmem_fclk_21285=50000000;
staticvoidserial21285_setup_ports(void)
{
serial21285_port.uartclk=mem_fclk_21285/4;
}
Heretheserial21285isthestructurethatdescribesuartdriver:
staticstructuart_driverserial21285_reg={
.owner=THIS_MODULE,
.driver_name="ttyFB",
.dev_name="ttyFB",
.major=SERIAL_21285_MAJOR,
.minor=SERIAL_21285_MINOR,
.nr=1,
.cons=SERIAL_21285_CONSOLE,
};
Ifthedriverregisteredsuccessfullyweattachthedriver-definedportserial21285_portstructurewiththeuart_add_one_portfunctionfromthedrivers/tty/serial/serial_core.csourcecodefileandreturnfromtheserial21285_initfunction:
if(ret==0)
uart_add_one_port(&serial21285_reg,&serial21285_port);
returnret;
LinuxInside
252Lastpart
That'sall.Ourdriverisinitialized.Whenanuartportwillbeopenedwiththecalloftheuart_openfunctionfromthedrivers/tty/serial/serial_core.c,itwillcalltheuart_startupfunctiontostartuptheserialport.Thisfunctionwillcallthestartupfunctionthatispartoftheuart_opsstructure.Eachuartdriverhasthedefinitionofthisstructure,inourcaseitis:
staticstructuart_opsserial21285_ops={
...
.startup=serial21285_startup,
...
}
serial21285structure.Aswecanseethe.strartupfieldreferencesontheserial21285_startupfunction.Implementationofthisfunctionisveryinterestingforus,becauseitisrelatedtotheinterruptsandinterrupthandling.
Let'slookattheimplementationoftheserial21285function:
staticintserial21285_startup(structuart_port*port)
{
intret;
tx_enabled(port)=1;
rx_enabled(port)=1;
ret=request_irq(IRQ_CONRX,serial21285_rx_chars,0,
serial21285_name,port);
if(ret==0){
ret=request_irq(IRQ_CONTX,serial21285_tx_chars,0,
serial21285_name,port);
if(ret)
free_irq(IRQ_CONRX,port);
}
returnret;
}
FirstofallaboutTXandRX.Aserialbusofadeviceconsistsofjusttwowires:oneforsendingdataandanotherforreceiving.Assuch,serialdevicesshouldhavetwoserialpins:thereceiver-RX,andthetransmitter-TX.Withthecalloffirsttwomacros:tx_enabledandrx_enabled,weenablethesewires.Thefollowingpartofthesefunctionisthegreatestinterestforus.Noteonrequest_irqfunctions.Thisfunctionregistersaninterrupthandlerandenablesagiveninterruptline.Let'slookattheimplementationofthisfunctionandgetintothedetails.Thisfunctiondefinedintheinclude/linux/interrupt.hheaderfileandlooksas:
staticinlineint__must_check
request_irq(unsignedintirq,irq_handler_thandler,unsignedlongflags,
constchar*name,void*dev)
{
returnrequest_threaded_irq(irq,handler,NULL,flags,name,dev);
}
Aswecansee,therequest_irqfunctiontakesfiveparameters:
irq-theinterruptnumberthatbeingrequested;handler-thepointertotheinterrupthandler;flags-thebitmaskoptions;
Requestingirqline
LinuxInside
253Lastpart
name-thenameoftheownerofaninterrupt;dev-thepointerusedforsharedinterruptlines;
Nowlet'slookatthecallsoftherequest_irqfunctionsinourexample.AswecanseethefirstparameterisIRQ_CONRX.Weknowthatitisnumberoftheinterrupt,butwhatisitCONRX?Thismacrodefinedinthearch/arm/mach-footbridge/include/mach/irqs.hheaderfile.Wecanfindthefulllistofinterruptsthatthe21285boardcangenerate.Notethatinthesecondcalloftherequest_irqfunctionwepasstheIRQ_CONTXinterruptnumber.BoththeseinterruptswillhandleRXandTXeventinourdriver.Implementationofthesemacrosiseasy:
#defineIRQ_CONRX_DC21285_IRQ(0)
#defineIRQ_CONTX_DC21285_IRQ(1)
...
...
...
#define_DC21285_IRQ(x)(16+(x))
TheISAIRQsonthisboardarefrom0to15,so,ourinterruptswillhavefirsttwonumbers:16and17.Secondparametersfortwocallsoftherequest_irqfunctionsareserial21285_rx_charsandserial21285_tx_chars.ThesefunctionswillbecalledwhenanRXorTXinterruptoccured.Wewillnotdiveinthispartintodetailsofthesefunctions,becausethischaptercoverstheinterruptsandinterruptshandlingbutnotdeviceanddrivers.Thenextparameter-flagsandaswecansee,itiszeroinbothcallsoftherequest_irqfunction.AllacceptableflagsaredefinedasIRQF_*macrosintheinclude/linux/interrupt.h.Someofit:
IRQF_SHARED-allowssharingtheirqamongseveraldevices;IRQF_PERCPU-aninterruptispercpu;IRQF_NO_THREAD-aninterruptcannotbethreaded;IRQF_NOBALANCING-excludesthisinterruptfromirqbalancing;IRQF_IRQPOLL-aninterruptisusedforpolling;andetc.
Inourcasewepass0,soitwillbeIRQF_TRIGGER_NONE.Thisflagmeansthatitdoesnotimplyanykindofedgeorleveltriggeredinterruptbehaviour.Tothefourthparameter(name),wepasstheserial21285_namethatdefinedas:
staticconstcharserial21285_name[]="FootbridgeUART";
andwillbedisplayedintheoutputofthe/proc/interrupts.Andinthelastparameterwepassthepointertotheourmainuart_portstructure.Nowweknowalittleaboutrequest_irqfunctionanditsparameters,let'slookatitsimplemenetation.Aswecanseeabove,therequest_irqfunctionjustmakesacalloftherequest_threaded_irqfunctioninside.Therequest_threaded_irqfunctiondefinedinthekernel/irq/manage.csourcecodefileandallocatesagiveninterruptline.Ifwewilllookatthisfunction,itstartsfromthedefinitionoftheirqactionandtheirq_desc:
intrequest_threaded_irq(unsignedintirq,irq_handler_thandler,
irq_handler_tthread_fn,unsignedlongirqflags,
constchar*devname,void*dev_id)
{
structirqaction*action;
structirq_desc*desc;
intretval;
...
...
...
}
Weareladysawtheirqactionandtheirq_descstructuresinthischapter.Thefirststructurerepresentsperinterruptactiondescriptorandcontainspointerstotheinterrupthandler,nameofthedevice,interruptnumber,etc.Thesecond
LinuxInside
254Lastpart
structurerepresentsadescriptorofaninterruptandcontainspointertotheirqaction,interruptflags,etc.Notethattherequest_threaded_irqfunctioncalledbytherequest_irqwiththeadditioanalparameter:irq_handler_tthread_fn.IfthisparameterisnotNULL,theirqthreadwillbecreatedandthegivenirqhandlerwillbeexecutedinthisthread.Inthenextstepweneedtomakefollowingchecks:
if(((irqflags&IRQF_SHARED)&&!dev_id)||
(!(irqflags&IRQF_SHARED)&&(irqflags&IRQF_COND_SUSPEND))||
((irqflags&IRQF_NO_SUSPEND)&&(irqflags&IRQF_COND_SUSPEND)))
return-EINVAL;
Firstofallwecheckthatrealdev_idispassedforthesharedinterruptandtheIRQF_COND_SUSPENDonlymakessenseforsharedinterrupts.Othrewiseweexitfromthisfunctionwiththe-EINVALerror.Afterthisweconvertthegivenirqnumbertotheirqdescriptorwitthehelpoftheirq_to_descfunctionthatdefinedinthekernel/irq/irqdesc.csourcecodefileandexitfromthisfunctionwiththe-EINVALerrorifitwasnotsuccessful:
desc=irq_to_desc(irq);
if(!desc)
return-EINVAL;
Theirq_to_descfunctionchecksthatgivenirqnumberislessthanmaximumnumberofIRQsandreturnstheirqdescriptorwheretheirqnumberisoffsetfromtheirq_descarray:
structirq_desc*irq_to_desc(unsignedintirq)
{
return(irq<NR_IRQS)?irq_desc+irq:NULL;
}
Aswehaveconvertedirqnumbertotheirqdescriptorwemakethecheckthestatusofthedescriptorthataninterruptcanberequested:
if(!irq_settings_can_request(desc)||WARN_ON(irq_settings_is_per_cpu_devid(desc)))
return-EINVAL;
andexitwiththe-EINVALinothreway.Afterthiswecheckthegiveninterrupthandler.Ifitwasnotpassedtotherequest_irqfunction,wecheckthethread_fn.IfbothhandlersareNULL,wereturnwiththe-EINVAL.Ifaninterrupthandlerwasnotpassedtotherequest_irqfunction,butthethread_fnisnotnull,wesethandlertotheirq_default_primary_handler:
if(!handler){
if(!thread_fn)
return-EINVAL;
handler=irq_default_primary_handler;
}
Inthenextstepweallocatememoryforourirqactionwiththekzallocfunctionandreturnfromthefunctionifthisoperationwasnotsuccessful:
action=kzalloc(sizeof(structirqaction),GFP_KERNEL);
if(!action)
return-ENOMEM;
LinuxInside
255Lastpart
MoreaboutkzallocwillbeintheseparatechapteraboutmemorymanagementintheLinuxkernel.Asweallocatedspacefortheirqaction,westarttoinitializethisstructurewiththevaluesofinterrupthandler,interruptflags,devicename,etc:
action->handler=handler;
action->thread_fn=thread_fn;
action->flags=irqflags;
action->name=devname;
action->dev_id=dev_id;
Intheendoftherequest_threaded_irqfunctionwecallthe__setup_irqfunctionfromthekernel/irq/manage.candregistersagivenirqaction.Releasememoryfortheirqactionandreturn:
chip_bus_lock(desc);
retval=__setup_irq(irq,desc,action);
chip_bus_sync_unlock(desc);
if(retval)
kfree(action);
returnretval;
Notethatthecallofthe__setup_irqfunctionisplacedbetweenthechip_bus_lockandthechip_bus_sync_unlockfunctions.Thesefunctionslocl/unlockaccesstoslowbus(likei2c)chips.Nowlet'slookattheimplementationofthe__setup_irqfunction.Inthebeginningofthe__setup_irqfunctionwecanseeacoupleofdifferentchecks.FirstofallwecheckthatthegiveninterruptdescriptorisnotNULL,irqchipisnotNULLandthatgiveninterruptdescriptormoduleownerisnotNULL.Afterthiswecheckisinterruptnestintoanotherinterruptthreadornot,andifitisnestedwereplacetheirq_default_primary_handlerwiththeirq_nested_primary_handler.
Inthenextstepwecreateanirqhandlerthreadwiththekthread_createfunction,ifthegiveninterruptisnotnestedandthethread_fnisnotNULL:
if(new->thread_fn&&!nested){
structtask_struct*t;
t=kthread_create(irq_thread,new,"irq/%d-%s",irq,new->name);
...
}
Andfilltherestofthegiveninterruptdescriptorfieldsintheend.So,our16and17interruptrequestlinesareregisteredandtheandfunctionswillbeinvokedwhenaninterruptcontrollerwillgeteventreleatedtotheseinterrupts.Nowlet'slookatwhathappenswhenaninterruptoccurs.
Inthepreviousparagraphwesawtherequestingoftheirqlineforthegiveninterruptdescriptorandregistrationoftheirqactionstructureforthegiveninterrupt.Wealreadyknowthatwhenaninterrupteventoccurs,aninterruptcontrollernotifiestheprocessoraboutthiseventandprocessortriestofindappropriateinterruptgateforthisinterrupt.Ifyouhavereadtheeighthpartofthischapter,youmayrememberthenative_init_IRQfunction.ThisfunctionmakesinitializationofthelocalAPIC.Thefollowingpartofthisfunctionisthemostinterestingpartforusrightnow:
for_each_clear_bit_from(i,used_vectors,first_system_vector){
set_intr_gate(i,irq_entries_start+
8*(i-FIRST_EXTERNAL_VECTOR));
}
Preparetohandleaninterrupt
LinuxInside
256Lastpart
Hereweiterateoveralltheclearedbitoftheused_vectorsbitmapstartingatfirst_system_vectorthatis:
intfirst_system_vector=FIRST_SYSTEM_VECTOR;//0xef
andsetinterruptgateswiththeivectornumberandtheirq_entries_start+8*(i-FIRST_EXTERNAL_VECTOR)startaddress.Onlyonethingsisunclearhere-theirq_entries_start.Thissymboldefinedinthearch/x86/entry/entry_64.Sassemblyfileandprovidesirqentries.Let'slookatit:
.align8
ENTRY(irq_entries_start)
vector=FIRST_EXTERNAL_VECTOR
.rept(FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR)
pushq$(~vector+0x80)
vector=vector+1
jmpcommon_interrupt
.align8
.endr
END(irq_entries_start)
HerewecanseetheGNUassembler.reptinstructionwhichrepeatsthethesequenceoflinesthatarebefore.endr-FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTORtimes.Aswealreadyknow,theFIRST_SYSTEM_VECTORis0xef,andtheFIRST_EXTERNAL_VECTORisequalto0x20.So,itwillwork:
>>>0xef-0x20
207
times.Inthebodyofthe.reptinstructionwepushentrystubsonthestack(notethatweusenegativenumbersfortheinterruptvectornumbers,becausepositivenumbersalreadyreservedtoidentifysystemcalls),incrementthevectorvariableandjumponthecommon_interruptlabel.Inthecommon_interruptweadjustvectornumberonthestackandexecuteinterruptnumberwiththedo_IRQparameter:
common_interrupt:
addq$-0x80,(%rsp)
interruptdo_IRQ
Themacrointerruptdefinedinthesamesourcecodefileandsavesgeneralpurposeregistersonthestack,changetheuserspacegsonthekernelwiththeSWAPGSassemblerinstructionifneed,incrementper-cpu-irq_countvariablethatshowsthatweareininterruptandcallthedo_IRQfunction.Thisfunctiondefinedinthearch/x86/kernel/irq.csourcecodefileandhandlesourdeviceinterrupt.Let'slookatthisfunction.Thedo_IRQfunctiontakesoneparameter-pt_regsstructurethatstoresvaluesoftheuserspaceregisters:
__visibleunsignedint__irq_entrydo_IRQ(structpt_regs*regs)
{
structpt_regs*old_regs=set_irq_regs(regs);
unsignedvector=~regs->orig_ax;
unsignedirq;
irq_enter();
exit_idle();
...
...
...
}
Atthebeginningofthisfunctionwecanseecalloftheset_irq_regsfunctionthatreturnssavedper-cpuirqregisterpointer
LinuxInside
257Lastpart
andthecallsoftheirq_enterandexit_idlefunctions.Thefirstfunctionirq_enterenterstoaninterruptcontextwiththeupdating__preempt_countvariableandthesectionfunction-exit_idlechecksthatcurrentprocessisidlewithpid-0andnotifytheidle_notifierwiththeIDLE_END.
Inthenextstepwereadtheirqforthecurrentcpuandcallthehandle_irqfunction:
irq=__this_cpu_read(vector_irq[vector]);
if(!handle_irq(irq,regs)){
...
...
...
}
...
...
...
Thehandle_irqfunctiondefinedinthearch/x86/kernel/irq_64.csourcecodefile,checksthegiveninterruptdescriptorandcallthegeneric_handle_irq_desc:
desc=irq_to_desc(irq);
if(unlikely(!desc))
returnfalse;
generic_handle_irq_desc(irq,desc);
Wherethegeneric_handle_irq_desccallstheinterrupthandler:
staticinlinevoidgeneric_handle_irq_desc(unsignedintirq,structirq_desc*desc)
{
desc->handle_irq(irq,desc);
}
Butstop...Whatisithandle_irqandwhydowecallourinterrupthandlerfromtheinterruptdescriptorwhenweknowthatirqactionpointstotheactualinterrupthandler?Actuallytheirq_desc->handle_irqisahigh-levelAPIforthecallinginterrupthandlerroutine.ItsetupsduringinitializationofthedevicetreeandAPICinitialization.Thekernelselectscorrectfunctionandcallchainoftheirq->action(s)there.Inthisway,theserial21285_tx_charsortheserial21285_rx_charsfunctionwillbeexecutedafteraninterruptwilloccur.
Intheendofthedo_IRQfunctionwecalltheirq_exitfunctionthatwillexitfromtheinterruptcontext,theset_irq_regswiththeolduserspaceregistersandreturn:
irq_exit();
set_irq_regs(old_regs);
return1;
WealreadyknowthatwhenanIRQfinishesitswork,deferredinterruptswillbeexecutediftheyexist.
Ok,theinterrupthandlerfinisheditsexecutionandnowwemustreturnfromtheinterrupt.Whentheworkofthedo_IRQfunctionwillbefinsihed,wewillreturnbacktotheassemblercodeinthearch/x86/entry/entry_64.Stotheret_from_intrlabel.FirstofallwedisableinterruptswiththeDISABLE_INTERRUPTSmacrothatexpandstothecliinstructionanddecrementvalueoftheirq_countper-cpuvariable.Remember,thisvariablehadvalue-1,whenwewereininterruptcontext:
Exitfrominterrupt
LinuxInside
258Lastpart
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
declPER_CPU_VAR(irq_count)
Inthelaststepwecheckthepreviouscontext(userorkernel),restoreitinacorrectwayandexitfromaninterruptwiththe:
INTERRUPT_RETURN
wheretheINTERRUPT_RETURNmacrois:
#defineINTERRUPT_RETURNjmpnative_iret
and
ENTRY(native_iret)
.globalnative_irq_return_iret
native_irq_return_iret:
iretq
That'sall.
ItistheendofthetenthpartoftheInterruptsandInterruptHandlingchapterandasyouhavereadinthebeginningofthispart-itisthelastpartofthischapter.Thischapterstartedfromtheexplanationofthetheoryofinterruptsandwehavelearnedwhatisitinterruptandkindsofinterrupts,thenwesawexceptionsandhandlingofthiskindofinterrupts,deferredinterruptsandfinallywelookedonthehardwareinterruptsandthanldingoftheirinthispart.Ofcourse,thispartandeventhischapterdoesnotcoverfullaspectsofinterruptsandinterrupthandlingintheLinuxkernel.Itisnotrealistictodothis.Atleastforme.Itwasthebigpart,Idon'tknowhowaboutyou,butitwasreallybigforme.ThisthemeismuchbiggerthanthischapterandIamnotsurethatsomewherethereisabookthatcoversit.Wehavemissedmanypartandaspectsofinterruptsandinterrupthandling,butIthinkitwillbegoodpointtodiveinthekernelcoderelatedtotheinterruptsandinterruptshandling.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
SerialdriverdocumentationStrongARM**SA-110/21285EvaluationBoardIRQmoduleinitcalluartISAmemorymanagement
Conclusion
Links
LinuxInside
259Lastpart
i2cAPICGNUassemblerProcessorregisterper-cpupiddevicetreesystemcallsPreviouspart
LinuxInside
260Lastpart
Thischapterdescribesthesystemcallconceptinthelinuxkernel.
Introductiontosystemcallconcept-thispartisintroductiontothesystemcallconceptintheLinuxkernel.HowtheLinuxkernelhandlesasystemcall-thispartdescribeshowtheLinuxkernelhandlesasystemcallfromanuserspaceapplication.vsyscallandvDSO-thirdpartdescribesvsyscallandvDSOconcepts.HowtheLinuxkernelrunsaprogram-thispartdescribesstartupprocessofaprogram.
Systemcalls
LinuxInside
261Systemcalls
Thispostopensupanewchapterinlinux-insidesbookandasyoumayunderstandfromthetitle,thischapterwillbedevotedtotheSystemcallconceptintheLinuxkernel.Thechoiceoftopicforthischapterisnotaccidental.Inthepreviouschapterwesawinterruptsandinterrupthandling.Theconceptofsystemcallsisverysimilartothatofinterrupts.Thisisbecausethemostcommonwaytoimplementsystemcallsisassoftwareinterrupts.Wewillseemanydifferentaspectsthatarerelatedtothesystemcallconcept.Forexample,wewilllearnwhat'shappeningwhenasystemcalloccursfromuserspace,wewillseeanimplementationofacouplesystemcallhandlersintheLinuxkernel,VDSOandvsyscallconceptsandmanymanymore.
BeforewestarttodiveintotheimplementationofthesystemcallsrelatedstuffintheLinuxkernelsourcecode,itisgoodtoknowsometheoryaboutsystemcalls.Let'sdoitinthefollowingparagraph.
Asystemcallisjustauserspacerequestofakernelservice.Yes,theoperatingsystemkernelprovidesmanyservices.Whenyourprogramwantstowritetoorreadfromafile,starttolistenforconnectionsonasocket,deleteorcreatedirectory,oreventofinishitswork,aprogramusesasystemcall.Inanotherwords,asystemcallisjustaCfunctionthatisplacedinthekernelspaceandanuserprogramcanaskkerneltodosomethingviathisfunction.
TheLinuxkernelprovidesasetofthesefunctionsandeacharchitectureprovidesitsownset.Forexample:thex86_64provides322systemcallsandthex86provides358differentsystemcalls.Ok,asystemcallisjustafunction.Let'slookonasimpleHelloworldexamplethat'swrittenintheassemblyprogramminglanguage:
.data
msg:
.ascii"Hello,world!\n"
len=.-msg
.text
.global_start
_start:
movq$1,%rax
movq$1,%rdi
movq$msg,%rsi
movq$len,%rdx
syscall
movq$60,%rax
xorq%rdi,%rdi
syscall
Wecancompiletheabovewiththefollowingcommands:
$gcc-ctest.S
$ld-otesttest.o
andrunitasfollows:
SystemcallsintheLinuxkernel.Part1.
Introduction
Systemcall.Whatisit?
LinuxInside
262Introductiontosystemcalls
./test
Hello,world!
Ok,whatdoweseehere?ThissimplecoderepresentsHelloworldassemblyprogramfortheLinuxx86_64architecture.Wecanseetwosectionshere:
.data
.text
Thefirstsection-.datastoresinitializeddataofourprogram(Helloworldstringanditslengthinourcase).Thesecondsection-.textcontainsthecodeofourprogram.Wecansplitthecodeofourprogramintotwoparts:firstpartwillbebeforethefirstsyscallinstructionandthesecondpartwillbebetweenfirstandsecondsyscallinstructions.Firstofallwhatdoesthesyscallinstructiondoinourcodeandgenerally?Aswecanreadinthe64-ia-32-architectures-software-developer-vol-2b-manual:
SYSCALLinvokesanOSsystem-callhandleratprivilegelevel0.Itdoessoby
loadingRIPfromtheIA32_LSTARMSR(aftersavingtheaddressoftheinstruction
followingSYSCALLintoRCX).(TheWRMSRinstructionensuresthatthe
IA32_LSTARMSRalwayscontainacanonicaladdress.)
...
...
...
SYSCALLloadstheCSandSSselectorswithvaluesderivedfrombits47:32ofthe
IA32_STARMSR.However,theCSandSSdescriptorcachesarenotloadedfromthe
descriptors(inGDTorLDT)referencedbythoseselectors.
Instead,thedescriptorcachesareloadedwithfixedvalues.Itistherespon-
sibilityofOSsoftwaretoensurethatthedescriptors(inGDTorLDT)referenced
bythoseselectorvaluescorrespondtothefixedvaluesloadedintothedescriptor
caches;theSYSCALLinstructiondoesnotensurethiscorrespondence.
andweareinitializingsyscallsbythewritingoftheentry_SYSCALL_64thatdefinedinthearch/x86/entry/entry_64.SassemblerfileandrepresentsSYSCALLinstructionentrytotheIA32_STARModelspecificregister:
wrmsrl(MSR_LSTAR,entry_SYSCALL_64);
inthearch/x86/kernel/cpu/common.csourcecodefile.
So,thesyscallinstructioninvokesahandlerofagivensystemcall.Buthowdoesitknowwhichhandlertocall?Actuallyitgetsthisinformationfromthegeneralpurposeregisters.Asyoucanseeinthesystemcalltable,eachsystemcallhasanuniquenumber.Inourexample,firstsystemcallis-writethatwritesdatatothegivenfile.Let'slookinthesystemcalltableandtrytofindwritesystemcall.Aswecansee,thewritesystemcallhasnumber-1.Wepassthenumberofthissystemcallthroughtheraxregisterinourexample.Thenextgeneralpurposeregisters:%rdi,%rsiand%rdxtakeparametersofthewritesyscall.Inourcase,theyarefiledescriptor(1isstdoutinourcase),secondparameteristhepointertoourstring,andthethirdissizeofdata.Yes,youheardright.Parametersforasystemcall.AsIalreadywroteabove,asystemcallisajustCfunctioninthekernelspace.Inourcasefirstsystemcalliswrite.Thissystemcalldefinedinthefs/read_write.csourcecodefileandlookslike:
SYSCALL_DEFINE3(write,unsignedint,fd,constchar__user*,buf,
size_t,count)
{
...
...
...
}
LinuxInside
263Introductiontosystemcalls
Orinotherwords:
ssize_twrite(intfd,constvoid*buf,size_tnbytes);
Don'tworryabouttheSYSCALL_DEFINE3macrofornow,we'llcomebacktoit.
Thesecondpartofourexampleisthesame,butwecallothersystemcall.Inthiscasewecallexitsystemcall.Thissystemcallgetsonlyoneparameter:
Returnvalue
andhandlesthewayourprogramexits.Wecanpasstheprogramnameofourprogramtothestraceutilandwewillseeoursystemcalls:
$stracetest
execve("./test",["./test"],[/*62vars*/])=0
write(1,"Hello,world!\n",14Hello,world!
)=14
_exit(0)=?
+++exitedwith0+++
Inthefirstlineofthestraceoutput,wecanseeexecvesystemcallthatexecutesourprogram,andthesecondandthirdaresystemcallsthatwehaveusedinourprogram:writeandexit.Notethatwepasstheparameterthroughthegeneralpurposeregistersinourexample.Theorderoftheregistersisnotaccidental.Theorderoftheregistersisdefinedbythefollowingagreement-x86-64callingconventions.Thisandotheragreementforthex86_64architectureexplainedinthespecialdocument-SystemVApplicationBinaryInterface.PDF.Inageneralway,argument(s)ofafunctionareplacedeitherinregistersorpushedonthestack.Therightorderis:
rdi;rsi;rdx;rcx;r8;r9.
forthefirstsixparametersofafunction.Ifafunctionhasmorethansixarguments,otherparameterswillbeplacedonthestack.
Wedonotusesystemcallsinourcodedirectly,butourprogramusesitwhenwewanttoprintsomething,checkaccesstoafileorjustwriteorreadsomethingtoit.
Forexample:
#include<stdio.h>
intmain(intargc,char**argv)
{
FILE*fp;
charbuff[255];
fp=fopen("test.txt","r");
fgets(buff,255,fp);
printf("%s\n",buff);
fclose(fp);
return0;
}
LinuxInside
264Introductiontosystemcalls
Therearenofopen,fgets,printfandfclosesystemcallsintheLinuxkernel,butopen,readwriteandcloseinstead.Ithinkyouknowthatthesefourfunctionsfopen,fgets,printfandfclosearejustfunctionsthatdefinedintheCstandardlibrary.Actuallythesefunctionsarewrappersforthesystemcalls.Wedonotcallsystemcallsdirectlyinourcode,butusingwrapperfunctionsfromthestandardlibrary.Themainreasonofthisissimple:asystemcallmustbeperformedquickly,veryquickly.Asasystemcallmustbequick,itmustbesmall.Thestandardlibrarytakesresponsibilitytoperformsystemcallswiththecorrectsetparametersandmakesdifferentchecksbeforeitwillcallthegivensystemcall.Let'scompileourprogramwiththefollowingcommand:
$gcctest.c-otest
andlookonitwiththeltraceutil:
$ltrace./test
__libc_start_main(["./test"]<unfinished...>
fopen("test.txt","r")=0x602010
fgets("HelloWorld!\n",255,0x602010)=0x7ffd2745e700
puts("HelloWorld!\n"HelloWorld!
)=14
fclose(0x602010)=0
+++exited(status0)+++
Theltraceutildisplaysasetofuserspacecallsofaprogram.Thefopenfunctionopensthegiventextfile,thefgetsreadsfilecontenttothebufbuffer,theputsfunctionprintsittothestdoutandthefclosefunctionclosesfilebythegivenfiledescriptor.AndasIalreadywrote,allofthesefunctionscallanappropriatesystemcall.Forexampleputscallsthewritesystemcallinside,wecanseeitifwewilladd-Soptiontotheltraceprogram:
write@SYS(1,"HelloWorld!\n\n",14)=14
Yes,systemcallsareubiquitous.Eachprogramneedstoopen/write/readfile,networkconnection,allocatememoryandmanyotherthingsthatcanbeprovidedonlybythekernel.Theprocfilesystemcontainsspecialfilesinaformat:/proc/pid/systemcallthatexposesthesystemcallnumberandargumentregistersforthesystemcallcurrentlybeingexecutedbytheprocess.Forexample,pid1,thatissystemdforme:
$sudocat/proc/1/comm
systemd
$sudocat/proc/1/syscall
2320x40x7ffdf82e11b00x1f0xffffffff0x1000x7ffdf82e11bf0x7ffdf82e11a00x7f9114681193
thesystemcallwithnumber-232whichisepoll_waitsystemcallthatwaitsforanI/Oeventonanepollfiledescriptor.OrforexampleemacseditorwhereI'mwritingthispart:
$psax|grepemacs
2093?Sl2:40emacs
$sudocat/proc/2093/comm
emacs
$sudocat/proc/2093/syscall
2700xf0x7fff068a5a900x7fff068a5b100x00x7fff068a59c00x7fff068a59d00x7fff068a59b00x7f777dd8813c
LinuxInside
265Introductiontosystemcalls
thesystemcallwiththenumber270whichissys_pselect6systemcallthatallowsemacstomonitormultiplefiledescriptors.
Nowweknowalittleaboutsystemcall,whatisitandwhyweneedinit.Solet'slookatthewritesystemcallthatourprogramused.
Let'slookattheimplementationofthissystemcalldirectlyinthesourcecodeoftheLinuxkernel.Aswealreadyknow,thewritesystemcallisdefinedinthefs/read_write.csourcecodefileandlookslikethis:
SYSCALL_DEFINE3(write,unsignedint,fd,constchar__user*,buf,
size_t,count)
{
structfdf=fdget_pos(fd);
ssize_tret=-EBADF;
if(f.file){
loff_tpos=file_pos_read(f.file);
ret=vfs_write(f.file,buf,count,&pos);
if(ret>=0)
file_pos_write(f.file,pos);
fdput_pos(f);
}
returnret;
}
Firstofall,theSYSCALL_DEFINE3macroisdefinedintheinclude/linux/syscalls.hheaderfileandexpandstothedefinitionofthesys_name(...)function.Let'slookatthismacro:
#defineSYSCALL_DEFINE3(name,...)SYSCALL_DEFINEx(3,_##name,__VA_ARGS__)
#defineSYSCALL_DEFINEx(x,sname,...)\
SYSCALL_METADATA(sname,x,__VA_ARGS__)\
__SYSCALL_DEFINEx(x,sname,__VA_ARGS__)
AswecanseetheSYSCALL_DEFINE3macrotakesnameparameterwhichwillrepresentnameofasystemcallandvariadicnumberofparameters.ThismacrojustexpandstotheSYSCALL_DEFINExmacrothattakesthenumberoftheparametersthegivensystemcall,the_##namestubforthefuturenameofthesystemcall(moreabouttokensconcatenationwiththe##youcanreadinthedocumentationofgcc).NextwecanseetheSYSCALL_DEFINExmacro.Thismacroexpandstothetwofollowingmacros:
SYSCALL_METADATA;__SYSCALL_DEFINEx.
ImplementationofthefirstmacroSYSCALL_METADATAdependsontheCONFIG_FTRACE_SYSCALLSkernelconfigurationoption.Aswecanunderstandfromthenameofthisoption,itallowstoenabletracertocatchthesyscallentryandexitevents.Ifthiskernelconfigrationoptionisenabled,theSYSCALL_METADATAmacroexecutesinitializationofthesyscall_metadatastructurethatdefinedintheinclude/trace/syscall.hheaderfileandcontainsdifferentusefulfieldsasnameofasystemcall,numberofasystemcallinthesystemcalltable,numberofparametersofasystemcall,listofparametertypesandetc:
#defineSYSCALL_METADATA(sname,nb,...)\
...\
...\
...\
structsyscall_metadata__used\
__syscall_meta_##sname={\
Implementationofwritesystemcall
LinuxInside
266Introductiontosystemcalls
.name="sys"#sname,\
.syscall_nr=-1,\
.nb_args=nb,\
.types=nb?types_##sname:NULL,\
.args=nb?args_##sname:NULL,\
.enter_event=&event_enter_##sname,\
.exit_event=&event_exit_##sname,\
.enter_fields=LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields),\
};\
staticstructsyscall_metadata__used\
__attribute__((section("__syscalls_metadata")))\
*__p_syscall_meta_##sname=&__syscall_meta_##sname;
IftheCONFIG_FTRACE_SYSCALLSkerneloptiondoesnotenabledduringkernelconfiguration,inthiswaytheSYSCALL_METADATAmacroexpandstoemptystring:
#defineSYSCALL_METADATA(sname,nb,...)
Thesecondmacro__SYSCALL_DEFINExexpandstothedefinitionofthefivefollowingfunctions:
#define__SYSCALL_DEFINEx(x,name,...)\
asmlinkagelongsys##name(__MAP(x,__SC_DECL,__VA_ARGS__))\
__attribute__((alias(__stringify(SyS##name))));\
\
staticinlinelongSYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__));\
\
asmlinkagelongSyS##name(__MAP(x,__SC_LONG,__VA_ARGS__));\
\
asmlinkagelongSyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))\
{\
longret=SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));\
__MAP(x,__SC_TEST,__VA_ARGS__);\
__PROTECT(x,ret,__MAP(x,__SC_ARGS,__VA_ARGS__));\
returnret;\
}\
\
staticinlinelongSYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
Thefirstsys##nameisdefinitionofthesyscallhandlerfunctionwiththegivenname-sys_system_call_name.The__SC_DECLmacrotakesthe__VA_ARGS__andcombinescallinputparametersystemtypeandtheparametername,becausethemacrodefinitionisunabletodeterminetheparametertypes.Andthe__MAPmacroapplies__SC_DECLmacrotothe__VA_ARGS__arguments.Theotherfunctionsthataregeneratedbythe__SYSCALL_DEFINExmacroareneedtoprotectfromtheCVE-2009-0029andwewillnotdiveintodetailsaboutthishere.Ok,asresultoftheSYSCALL_DEFINE3macro,wewillhave:
asmlinkagelongsys_write(unsignedintfd,constchar__user*buf,size_tcount);
Nowweknowalittleaboutthesystemcall'sdefinitionandwecangobacktotheimplementationofthewritesystemcall.Let'slookontheimplementationofthissystemcallagain:
SYSCALL_DEFINE3(write,unsignedint,fd,constchar__user*,buf,
size_t,count)
{
structfdf=fdget_pos(fd);
ssize_tret=-EBADF;
if(f.file){
loff_tpos=file_pos_read(f.file);
ret=vfs_write(f.file,buf,count,&pos);
if(ret>=0)
file_pos_write(f.file,pos);
LinuxInside
267Introductiontosystemcalls
fdput_pos(f);
}
returnret;
}
Aswealreadyknowandcanseefromthecode,ittakesthreearguments:
fd-filedescriptor;buf-buffertowrite;count-lengthofbuffertowrite.
andwritesdatafromabufferdeclaredbytheusertoagivendeviceorafile.Notethatthesecondparameterbuf,definedwiththe__userattribute.ThemainpurposeofthisattributeisforcheckingtheLinuxkernelcodewiththesparseutil.Itisdefinedintheinclude/linux/compiler.hheaderfileanddependsonthe__CHECKER__definitionintheLinuxkernel.That'sallaboutusefulmeta-informationrelatedtooursys_writesystemcall,let'strytounderstandhowthissystemcallisimplemented.AswecanseeitstartsfromthedefinitionofthefstructurethathasfdstructuretypethatrepresentfiledescriptorintheLinuxkernelandweputtheresultofthecallofthefdget_posfunction.Thefdget_posfunctiondefinedinthesamesourcecodefileandjustexpandsthecallofthe__to_fdfunction:
staticinlinestructfdfdget_pos(intfd)
{
return__to_fd(__fdget_pos(fd));
}
Themainpurposeofthefdget_posistoconvertthegivenfiledescriptorwhichisjustanumbertothefdstructure.Throughthelongchainoffunctioncalls,thefdget_posfunctiongetsthefiledescriptortableofthecurrentprocess,current->files,andtriestofindacorrespondingfiledescriptornumberthere.Aswegotthefdstructureforthegivenfiledescriptornumber,wecheckitandreturnifitdoesnotexist.Wegetthecurrentpositioninthefilewiththecallofthefile_pos_readfunctionthatjustreturnsf_posfieldoftheourfile:
staticinlineloff_tfile_pos_read(structfile*file)
{
returnfile->f_pos;
}
andcallthevfs_writefunction.Thevfs_writefunctiondefinedthefs/read_write.csourcecodefileanddoestheworkforus-writesgivenbuffertothegivenfilestartingfromthegivenposition.Wewillnotdiveintodetailsaboutthevfs_writefunction,becausethisfunctionisweaklyrelatedtothesystemcallconceptbutmostlyaboutVirtualfilesystemconceptwhichwewillseeinanotherchapter.Afterthevfs_writehasfinisheditswork,wechecktheresultandifitwasfinishedsuccessfullywechangethepositioninthefilewiththefile_pos_writefunction:
if(ret>=0)
file_pos_write(f.file,pos);
thatjustupdatesf_poswiththegivenpositioninthegivenfile:
staticinlinevoidfile_pos_write(structfile*file,loff_tpos)
{
file->f_pos=pos;
}
Attheendoftheourwritesystemcallhandler,wecanseethecallofthefollowingfunction:
LinuxInside
268Introductiontosystemcalls
fdput_pos(f);
unlocksthef_pos_lockmutexthatprotectsfilepositionduringconcurrentwritesfromthreadsthatsharefiledescriptor.
That'sall.
WehaveseenthepartialimplementationofonesystemcallprovidedbytheLinuxkernel.Ofcoursewehavemissedsomepartsintheimplementationofthewritesystemcall,becauseasImentionedabove,wewillseeonlysystemcallsrelatedstuffinthischapterandwillnotseeotherstuffrelatedtoothersubsystems,suchasVirtualfilesystem.
ThisconcludesthefirstpartcoveringsystemcallconceptsintheLinuxkernel.Wehavecoveredthetheoryofsystemcallssofarandinthenextpartwewillcontinuetodiveintothistopic,touchingLinuxkernelcoderelatedtosystemcalls.
Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.
systemcallvdsovsyscallgeneralpurposeregisterssocketCprogramminglanguagex86x86_64x86-64callingconventionsSystemVApplicationBinaryInterface.PDFGCCIntelmanual.PDFsystemcalltableGCCmacrodocumentationfiledescriptorstdoutstracestandardlibrarywrapperfunctionsltracesparseprocfilesystemVirtualfilesystemsystemdepollPreviouschapter
Conclusion
Links
LinuxInside
269Introductiontosystemcalls
ThepreviouspartwasthefirstpartofthechapterthatdescribesthesystemcallconceptsintheLinuxkernel.InthepreviouspartwelearnedwhatasystemcallisintheLinuxkernel,andinoperatingsystemsingeneral.Thiswasintroducedfromauser-spaceperspective,andpartofthewritesystemcallimplementationwasdiscussed.Inthispartwecontinueourlookatsystemcalls,startingwithsometheorybeforemovingontotheLinuxkernelcode.
Anuserapplicationdoesnotmakethesystemcalldirectlyfromourapplications.WedidnotwritetheHelloworld!programlike:
intmain(intargc,char**argv)
{
...
...
...
sys_write(fd1,buf,strlen(buf));
...
...
}
WecanusesomethingsimilarwiththehelpofCstandardlibraryanditwilllooksomethinglikethis:
#include<unistd.h>
intmain(intargc,char**argv)
{
...
...
...
write(fd1,buf,strlen(buf));
...
...
}
Butanyway,writeisnotadirectsystemcallandnotakernelfunction.Anapplicationmustfillgeneralpurposeregisterswiththecorrectvaluesinthecorrectorderandusethesyscallinstructiontomaketheactualsystemcall.InthispartwewilllookatwhatoccursintheLinuxkernelwhenthesyscallinstructionismetbytheprocessor.
Fromthepreviouspartweknowthatsystemcallconceptisverysimilartoaninterrupt.Furthermore,systemcallsareimplementedassoftwareinterrupts.So,whentheprocessorhandlesasyscallinstructionfromauserapplication,thisinstructioncausesanexceptionwhichtransferscontroltoanexceptionhandler.Asweknow,allexceptionhandlers(orinotherwordskernelCfunctionsthatwillreactonanexception)areplacedinthekernelcode.ButhowdoestheLinuxkernelsearchfortheaddressofthenecessarysystemcallhandlerfortherelatedsystemcall?TheLinuxkernelcontainsaspecialtablecalledthesystemcalltable.Thesystemcalltableisrepresentedbythesys_call_tablearrayintheLinuxkernelwhichisdefinedinthearch/x86/entry/syscall_64.csourcecodefile.Let'slookatitsimplementation:
asmlinkageconstsys_call_ptr_tsys_call_table[__NR_syscall_max+1]={
[0...__NR_syscall_max]=&sys_ni_syscall,
#include<asm/syscalls_64.h>
};
SystemcallsintheLinuxkernel.Part2.
HowdoestheLinuxkernelhandleasystemcall
Initializationofthesystemcallstable
LinuxInside
270HowtheLinuxkernelhandlesasystemcall
Aswecansee,thesys_call_tableisanarrayof__NR_syscall_max+1sizewherethe__NR_syscall_maxmacrorepresentsthemaximumnumberofsystemcallsforthegivenarchitecture.Thisbookisaboutthex86_64architecture,soforourcasethe__NR_syscall_maxis322andthisisthecorrectnumberatthetimeofwriting(currentLinuxkernelversionis4.2.0-rc8+).WecanseethismacrointheheaderfilegeneratedbyKbuildduringkernelcompilation-include/generated/asm-offsets.h`:
#define__NR_syscall_max322
Therewillbethesamenumberofsystemcallsinthearch/x86/entry/syscalls/syscall_64.tblforthex86_64.Therearetwoimportanttopicshere;thetypeofthesys_call_tablearray,andtheinitializationofelementsinthisarray.Firstofall,thetype.Thesys_call_ptr_trepresentsapointertoasystemcalltable.Itisdefinedastypedefforafunctionpointerthatreturnsnothingandanddoesnottakearguments:
typedefvoid(*sys_call_ptr_t)(void);
Thesecondthingistheinitializationofthesys_call_tablearray.Aswecanseeinthecodeabove,allelementsofourarraythatcontainpointerstothesystemcallhandlerspointtothesys_ni_syscall.Thesys_ni_syscallfunctionrepresentsnot-implementedsystemcalls.Tostartwith,allelementsofthesys_call_tablearraypointtothenot-implementedsystemcall.Thisisthecorrectinitialbehaviour,becauseweonlyinitializestorageofthepointerstothesystemcallhandlers,itispopulatedlateron.Implementationofthesys_ni_syscallisprettyeasy,itjustreturns-errnoor-ENOSYSinourcase:
asmlinkagelongsys_ni_syscall(void)
{
return-ENOSYS;
}
The-ENOSYSerrortellsusthat:
ENOSYSFunctionnotimplemented(POSIX.1)
Alsoanoteon...intheinitializationofthesys_call_table.WecandoitwithaGCCcompilerextensioncalled-DesignatedInitializers.Thisextensionallowsustoinitializeelementsinnon-fixedorder.Asyoucansee,weincludetheasm/syscalls_64.hheaderattheendofthearray.Thisheaderfileisgeneratedbythespecialscriptatarch/x86/entry/syscalls/syscalltbl.shandgeneratesourheaderfilefromthesyscalltable.Theasm/syscalls_64.hcontainsdefinitionsofthefollowingmacros:
__SYSCALL_COMMON(0,sys_read,sys_read)
__SYSCALL_COMMON(1,sys_write,sys_write)
__SYSCALL_COMMON(2,sys_open,sys_open)
__SYSCALL_COMMON(3,sys_close,sys_close)
__SYSCALL_COMMON(5,sys_newfstat,sys_newfstat)
...
...
...
The__SYSCALL_COMMONmacroisdefinedinthesamesourcecodefileandexpandstothe__SYSCALL_64macrowhichexpandstothefunctiondefinition:
#define__SYSCALL_COMMON(nr,sym,compat)__SYSCALL_64(nr,sym,compat)
LinuxInside
271HowtheLinuxkernelhandlesasystemcall
#define__SYSCALL_64(nr,sym,compat)[nr]=sym,
So,afterthis,oursys_call_tabletakesthefollowingform:
asmlinkageconstsys_call_ptr_tsys_call_table[__NR_syscall_max+1]={
[0...__NR_syscall_max]=&sys_ni_syscall,
[0]=sys_read,
[1]=sys_write,
[2]=sys_open,
...
...
...
};
Afterthisallelementsthatpointtothenon-implementedsystemcallswillcontaintheaddressofthesys_ni_syscallfunctionthatjustreturns-ENOSYSaswesawabove,andotherelementswillpointtothesys_syscall_namefunctions.
Atthispoint,wehavefilledthesystemcalltableandtheLinuxkernelknowswhereeachsystemcallhandleris.ButtheLinuxkerneldoesnotcallasys_syscall_namefunctionimmediatelyafteritisinstructedtohandleasystemcallfromauserspaceapplication.Rememberthechapteraboutinterruptsandinterrupthandling.WhentheLinuxkernelgetsthecontroltohandleaninterrupt,ithadtodosomepreparationslikesaveuserspaceregisters,switchtoanewstackandmanymoretasksbeforeitwillcallaninterrupthandler.Thereisthesamesituationwiththesystemcallhandling.Thepreparationforhandlingasystemcallisthefirstthing,butbeforetheLinuxkernelwillstartthesepreparations,theentrypointofasystemcallmustbeinitailizedandonlytheLinuxkernelknowshowtoperformthispreparation.InthenextparagraphwewillseetheprocessoftheinitializationofthesystemcallentryintheLinuxkernel.
Whenasystemcalloccursinthesystem,wherearethefirstbytesofcodethatstartstohandleit?AswecanreadintheIntelmanual-64-ia-32-architectures-software-developer-vol-2b-manual:
SYSCALLinvokesanOSsystem-callhandleratprivilegelevel0.
ItdoessobyloadingRIPfromtheIA32_LSTARMSR
itmeansthatweneedtoputthesystemcallentryintotheIA32_LSTARmodelspecificregister.ThisoperationtakesplaceduringtheLinuxkernelinitializationprocess.IfyouhavereadthefourthpartofthechapterthatdescribesinterruptsandinterrupthandlingintheLinuxkernel,youknowthattheLinuxkernelcallsthetrap_initfunctionduringtheinitializationprocess.Thisfunctionisdefinedinthearch/x86/kernel/setup.csourcecodefileandexecutestheinitializationofthenon-earlyexceptionhandlerslikedivideerror,coprocessorerroretc.Besidestheinitializationofthenon-earlyexceptionshandlers,thisfunctioncallsthecpu_initfunctionfromthearch/x86/kernel/cpu/common.csourcecodefilewhichbesidesinitializationofper-cpustate,callsthesyscall_initfunctionfromthesamesourcecodefile.
Thisfunctionperformstheinitializationofthesystemcallentrypoint.Let'slookontheimplementationofthisfunction.Itdoesnottakeparametersandfirstofallitfillstwomodelspecificregisters:
wrmsrl(MSR_STAR,((u64)__USER32_CS)<<48|((u64)__KERNEL_CS)<<32);
wrmsrl(MSR_LSTAR,entry_SYSCALL_64);
Thefirstmodelspecificregister-MSR_STARcontains63:48bitsoftheusercodesegment.ThesebitswillbeloadedtotheCSandSSsegmentregistersforthesysretinstructionwhichprovidesfunctionalitytoreturnfromasystemcalltousercodewiththerelatedprivilege.AlsotheMSR_STARcontains47:32bitsfromthekernelcodethatwillbeusedasthebaseselectorforCSandSSsegmentregisterswhenuserspaceapplicationsexecuteasystemcall.Inthesecondlineofcode
Initializationofthesystemcallentry
LinuxInside
272HowtheLinuxkernelhandlesasystemcall
wefilltheMSR_LSTARregisterwiththeentry_SYSCALL_64symbolthatrepresentssystemcallentry.Theentry_SYSCALL_64isdefinedinthearch/x86/entry/entry_64.Sassemblyfileandcontainscoderelatedtothepreparationpeformedbeforeasystemcallhandlerwillbeexecuted(Ialreadywroteaboutthesepreparations,readabove).Wewillnotconsiderthe
entry_SYSCALL_64now,butwillreturntoitlaterinthischapter.
Afterwehavesettheentrypointforsystemcalls,weneedtosetthefollowingmodelspecificregisters:
MSR_CSTAR-targetripforthecompabilitymodecallers;MSR_IA32_SYSENTER_CS-targetcsforthesysenterinstruction;MSR_IA32_SYSENTER_ESP-targetespforthesysenterinstruction;MSR_IA32_SYSENTER_EIP-targeteipforthesysenterinstruction.
ThevaluesofthesemodelspecificregisterdependontheCONFIG_IA32_EMULATIONkernelconfigurationoption.Ifthiskernelconfigurationoptionisenabled,itallowslegacy32-bitprogramstorunundera64-bitkernel.Inthefirstcase,iftheCONFIG_IA32_EMULATIONkernelconfigurationoptionisenabled,wefillthesemodelspecificregisterswiththeentrypointforthesystemcallsthecompabilitymode:
wrmsrl(MSR_CSTAR,entry_SYSCALL_compat);
andwiththekernelcodesegment,putzerotothestackpointerandwritetheaddressoftheentry_SYSENTER_compatsymboltotheinstructionpointer:
wrmsrl_safe(MSR_IA32_SYSENTER_CS,(u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP,0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP,(u64)entry_SYSENTER_compat);
Inanotherway,iftheCONFIG_IA32_EMULATIONkernelconfigurationoptionisdisabled,wewriteignore_sysretsymboltotheMSR_CSTAR:
wrmsrl(MSR_CSTAR,ignore_sysret);
thatisdefinedinthearch/x86/entry/entry_64.Sassemblyfileandjustreturns-ENOSYSerrorcode:
ENTRY(ignore_sysret)
mov$-ENOSYS,%eax
sysret
END(ignore_sysret)
NowweneedtofillMSR_IA32_SYSENTER_CS,MSR_IA32_SYSENTER_ESP,MSR_IA32_SYSENTER_EIPmodelspecificregistersaswedidinthepreviouscodewhentheCONFIG_IA32_EMULATIONkernelconfigurationoptionwasenabled.Inthiscase(whentheCONFIG_IA32_EMULATIONconfigurationoptionisnotset)wefilltheMSR_IA32_SYSENTER_ESPandtheMSR_IA32_SYSENTER_EIPwithzeroandputtheinvalidsegmentoftheGlobalDescriptorTabletotheMSR_IA32_SYSENTER_CSmodelspecificregister:
wrmsrl_safe(MSR_IA32_SYSENTER_CS,(u64)GDT_ENTRY_INVALID_SEG);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP,0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP,0ULL);
YoucanreadmoreabouttheGlobalDescriptorTableinthesecondpartofthechapterthatdescribesthebootingprocessoftheLinuxkernel.
Attheendofthesyscall_initfunction,wejustmaskflagsintheflagsregisterbywritingthesetofflagstothe
LinuxInside
273HowtheLinuxkernelhandlesasystemcall
MSR_SYSCALL_MASKmodelspecificregister:
wrmsrl(MSR_SYSCALL_MASK,
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
Theseflagswillbeclearedduringsyscallinitialization.That'sall,itistheendofthesyscall_initfunctionanditmeansthatsystemcallentryisreadytowork.Nowwecanseewhatwilloccurwhenanuserapplicationexecutesthesyscallinstruction.
AsIalreadywrote,beforeasystemcalloraninterrupthandlerwillbecalledbytheLinuxkernelweneedtodosomepreparations.Theidtentrymacroperformsthepreparationsrequiredbeforeanexceptionhandlerwillbeexecuted,theinterruptmacroperformsthepreparationsrequiresbeforeaninterrupthandlerwillbecalledandtheentry_SYSCALL_64willdothepreparationsrequiredbeforeasystemcallhandlerwillbeexecuted.
Theentry_SYSCALL_64isdefinedinthearch/x86/entry/entry_64.Sassemblyfileandstartsfromthefollowingmacro:
SWAPGS_UNSAFE_STACK
Thismacroisdefinedinthearch/x86/include/asm/irqflags.hheaderfileandexpandstotheswapgsinstruction:
#defineSWAPGS_UNSAFE_STACKswapgs
whichexchangesthecurrentGSbaseregistervaluewiththevaluecontainedintheMSR_KERNEL_GS_BASEmodelspecificregister.Inotherwordswemoveditontothekernelstack.Afterthiswepointtheoldstackpointertothersp_scratchper-cpuvariableandsetupthestackpointertopointtothetopofstackforthecurrentprocessor:
movq%rsp,PER_CPU_VAR(rsp_scratch)
movqPER_CPU_VAR(cpu_current_top_of_stack),%rsp
Inthenextstepwepushthestacksegmentandtheoldstackpointertothestack:
pushq$__USER_DS
pushqPER_CPU_VAR(rsp_scratch)
Afterthisweenableinterrupts,becauseinterruptsareoffonentryandsavethegeneralpurposeregisters(besidesbp,bxandfromr12tor15),flags,-ENOSYSforthenon-implementedsystemcallandcodesegmentregisteronthestack:
ENABLE_INTERRUPTS(CLBR_NONE)
pushq%r11
pushq$__USER_CS
pushq%rcx
pushq%rax
pushq%rdi
pushq%rsi
pushq%rdx
pushq%rcx
pushq$-ENOSYS
Preparationbeforesystemcallhandlerwillbecalled
LinuxInside
274HowtheLinuxkernelhandlesasystemcall
pushq%r8
pushq%r9
pushq%r10
pushq%r11
sub$(6*8),%rsp
Whenasystemcalloccursfromtheuser'sapplication,generalpurposeregistershavethefollowingstate:
rax-containssystemcallnumber;rcx-containsreturnaddresstotheuserspace;r11-containsregisterflags;rdi-containsfirstargumentofasystemcallhandler;rsi-containssecondargumentofasystemcallhandler;rdx-containsthirdargumentofasystemcallhandler;r10-containsfourthargumentofasystemcallhandler;r8-containsfifthargumentofasystemcallhandler;r9-containssixthargumentofasystemcallhandler;
Othergeneralpurposeregisters(asrbp,rbxandfromr12tor15)arecallee-preservedinCABI).Sowepushregisterflagsonthetopofthestack,thenusercodesegment,returnaddresstotheuserspace,systemcallnumber,firstthreearguments,dumperrorcodeforthenon-implementedsystemcallandotherargumentsonthestack.
Inthenextstepwecheckthe_TIF_WORK_SYSCALL_ENTRYinthecurrentthread_info:
testl$_TIF_WORK_SYSCALL_ENTRY,ASM_THREAD_INFO(TI_flags,%rsp,SIZEOF_PTREGS)
jnztracesys
The_TIF_WORK_SYSCALL_ENTRYmacroisdefinedinthearch/x86/include/asm/thread_info.hheaderfileandprovidessetofthethreadinformationflagsthatarerelatedtothesystemcallstracing:
#define_TIF_WORK_SYSCALL_ENTRY\
(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_EMU|_TIF_SYSCALL_AUDIT|\
_TIF_SECCOMP|_TIF_SINGLESTEP|_TIF_SYSCALL_TRACEPOINT|\
_TIF_NOHZ)
Wewillnotconsiderdebugging/tracingrelatedstuffinthischapter,butwillseeitintheseparatechapterthatwillbedevotedtothedebuggingandtracingtechniquesintheLinuxkernel.Afterthetracesyslabel,thenextlabelistheentry_SYSCALL_64_fastpath.Intheentry_SYSCALL_64_fastpathwecheckthe__SYSCALL_MASKthatisdefinedinthearch/x86/include/asm/unistd.hheaderfileand
#ifdefCONFIG_X86_X32_ABI
#define__SYSCALL_MASK(~(__X32_SYSCALL_BIT))
#else
#define__SYSCALL_MASK(~0)
#endif
wherethe__X32_SYSCALL_BITis
#define__X32_SYSCALL_BIT0x40000000
Aswecanseethe__SYSCALL_MASKdependsontheCONFIG_X86_X32_ABIkernelconfigurationoptionandrepresentsthemaskforthe32-bitABIinthe64-bitkernel.
LinuxInside
275HowtheLinuxkernelhandlesasystemcall
Sowecheckthevalueofthe__SYSCALL_MASKandiftheCONFIG_X86_X32_ABIisdisabledwecomparethevalueoftheraxregistertothemaximumsyscallnumber(__NR_syscall_max),alternativelyiftheCNOFIG_X86_X32_ABIisenabledwemasktheeaxregisterwiththe__X32_SYSCALL_BITanddothesamecomparison:
#if__SYSCALL_MASK==~0
cmpq$__NR_syscall_max,%rax
#else
andl$__SYSCALL_MASK,%eax
cmpl$__NR_syscall_max,%eax
#endif
AfterthiswechecktheresultofthelastcomparisonwiththejainstructionthatexecutesifCFandZFflagsarezero:
ja1f
andifwehavethecorrectsystemcallforthis,wemovethefourthargumentfromther10tothercxtokeepx86_64CABIcompliantandexecutethecallinstructionwiththeaddressofasystemcallhandler:
movq%r10,%rcx
call*sys_call_table(,%rax,8)
Note,thesys_call_tableisanarraythatwesawaboveinthispart.Aswealreadyknowtheraxgeneralpurposeregistercontainsthenumberofasystemcallandeachelementofthesys_call_tableis8-bytes.Soweareusing*sys_call_table(,%rax,8)thisnotationtofindthecorrectoffsetinthesys_call_tablearrayforthegivensystemcallhandler.
That'sall.Wedidalltherequiredpreparationsandthesystemcallhandlerwascalledforthegiveninterrupthandler,forexamplesys_read,sys_writeorothersystemcallhandlerthatisdefinedwiththeSYSCALL_DEFINE[N]macrointheLinuxkernelcode.
Afterasystemcallhandlerfinishesitswork,wewillreturnbacktothearch/x86/entry/entry_64.S,rightafterwherewehavecalledthesystemcallhandler:
call*sys_call_table(,%rax,8)
Thenextstepafterwe'vereturnedfromasystemcallhandleristoputthereturnvalueofasystemhandlerontothestack.Weknowthatasystemcallreturnstheresulttotheuserprograminthegeneralpurposeraxregister,sowearemovingitsvalueontothestackafterthesystemcallhandlerhasfinisheditswork:
movq%rax,RAX(%rsp)
ontheRAXplace.
AfterthiswecanseethecalloftheLOCKDEP_SYS_EXITmacrofromthearch/x86/include/asm/irqflags.h:
LOCKDEP_SYS_EXIT
Exitfromasystemcall
LinuxInside
276HowtheLinuxkernelhandlesasystemcall
TheimplementationofthismacrodependsontheCONFIG_DEBUG_LOCK_ALLOCkernelconfigurationoptionthatallowsustodebuglocksonexitfromasystemcall.Andagain,wewillnotconsideritinthischapter,butwillreturntoitinaseparateone.Intheendoftheentry_SYSCALL_64functionwerestoreallgeneralpurposeregistersbesidesrxcandr11,becausethercxregistermustcontainthereturnaddresstotheapplicationthatcalledsystemcallandther11registercontainstheoldflagsregister.Afterallgeneralpurposeregistersarerestored,wefillrcxwiththereturnaddress,r11registerwiththeflagsandrspwiththeoldstackpointer:
RESTORE_C_REGS_EXCEPT_RCX_R11
movqRIP(%rsp),%rcx
movqEFLAGS(%rsp),%r11
movqRSP(%rsp),%rsp
USERGS_SYSRET64
IntheendwejustcalltheUSERGS_SYSRET64macrothatexpandstothecalloftheswapgsinstructionwhichexchangesagaintheuserGSandkernelGSandthesysretqinstructionwhichexecutesonexitfromasystemcallhandler:
#defineUSERGS_SYSRET64\
swapgs;\
sysretq;
Nowweknowwhatoccurswhenanuserapplicationcallsasystemcall.Thefullpathofthisprocessisasfollows:
Userapplicationcontainscodethatfillsgeneralpurposerregisterwiththevalues(systemcallnumberandargumentsofthissystemcall);Processorswitchesfromtheusermodetokernelmodeandstartsexecutionofthesystemcallentry-entry_SYSCALL_64;entry_SYSCALL_64switchestothekernelstackandsavessomegeneralpurposeregisters,oldstackandcodesegment,flagsandetc...onthestack;entry_SYSCALL_64checksthesystemcallnumberintheraxregister,searchesasystemcallhandlerinthesys_call_tableandcallsit,ifthenumberofasystemcalliscorrect;Ifasystemcallisnotcorrect,jumponexitfromsystemcall;Afterasystemcallhandlerwillfinishitswork,restoregeneralpurposerregisters,oldstack,flagsandreturnaddressandexitfromtheentry_SYSCALL_64withthesysretqinstruction.
That'sall.
ThisistheendofthesecondpartaboutthesystemcallsconceptintheLinuxkernel.Inthepreviouspartwesawtheoryaboutthisconceptfromtheuserapplicationview.InthispartwecontinuedtodiveintothestuffwhichisrelatedtothesystemcallconceptandsawwhattheLinuxkerneldoeswhenasystemcalloccurs.
Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.
systemcall
Conclusion
Links
LinuxInside
277HowtheLinuxkernelhandlesasystemcall
writeCstandardlibrarylistofcpuarchitecturesx86_64kbuildtypedeferrnogccmodelspecificregisterintel2bmanualcoprocessorinstructionpointerflagsregisterGlobalDescriptorTableper-cpugeneralpurposeregistersABIx86_64CABIpreviouschapter
LinuxInside
278HowtheLinuxkernelhandlesasystemcall
ThisisthethirdpartofthechapterthatdescribessystemcallsintheLinuxkernelandwesawpreparationsafterasystemcallcausedbyanuserspaceapplicationandprocessofhandlingofasystemcallinthepreviouspart.Inthispartwewilllookattwoconceptsthatareveryclosetothesystemcallconcept,theyarecalledvsyscallandvdso.
Wealreadyknowwhatisasystemcall.ThisisspecialroutineintheLinuxkernelwhichuserspaceapplicationaskstodoprivilegedtasks,liketoreadortowritetoafile,toopenasocketandetc.Asyoumayknow,invokingasystemcallisanexpensiveoperationintheLinuxkernel,becausetheprocessormustinterruptthecurrentlyexecutingtaskandswitchcontexttokernelmode,subsequentlyjumpingagainintouserspaceafterthesystemcallhandlerfinishesitswork.Thesetwomechanisms-vsyscallandvdsoaredesignedtospeedupthisprocessforcertainsystemcallsandinthispartwewilltrytounderstandhowthesemechanismswork.
ThevsyscallorvirtualsystemcallisthefirstandoldestmechinismintheLinuxkernelthatisdesignedtoaccelerateexecutionofcertainsystemcalls.Theprincipleofworkofthevsyscallconceptissimple.TheLinuxkernelmapsintouserspaceapagethatcontainssomevariablesandtheimplementationofsomesystemcalls.WecanfindinformationaboutthismemoryspaceintheLinuxkerneldocumentationforthex86_64:
ffffffffff600000-ffffffffffdfffff(=8MB)vsyscalls
or:
~$sudocat/proc/1/maps|grepvsyscall
ffffffffff600000-ffffffffff601000r-xp0000000000:000[vsyscall]
Afterthis,thesesystemcallswillbeexecutedinuserspaceandthismeansthattherewillnotbecontextswitching.Mappingofthevsyscallpageoccursinthemap_vsyscallfunctionthatisdefinedinthearch/x86/entry/vsyscall/vsyscall_64.csourcecodefile.ThisfunctioniscalledduringtheLinuxkernelintializationinthesetup_archfunctionthatisdefinedinthearch/x86/kernel/setup.csourcecodefile(wesawthisfunctioninthefifthpartoftheLinuxkernelinitializationprocesschapter).
Notethatimplementationofthemap_vsyscallfunctiondependsontheCONFIG_X86_VSYSCALL_EMULATIONkernelconfigurationoption:
#ifdefCONFIG_X86_VSYSCALL_EMULATION
externvoidmap_vsyscall(void);
#else
staticinlinevoidmap_vsyscall(void){}
#endif
Aswecanreadinthehelptext,theCONFIG_X86_VSYSCALL_EMULATIONconfigurationoption:Enablevsyscallemulation.Whyemulatevsyscall?Actually,thevsyscallisalegacyABIduetosecurityreasons.Virtualsystemcallshavefixedaddresses,meaningthatvsyscallpageisstillatthesamelocationeverytimeandthelocationofthispageisdeterminedinthemap_vsyscallfunction.Let'slookontheimplementationofthisfunction:
SystemcallsintheLinuxkernel.Part3.
vsyscallsandvDSO
Introductiontovsyscalls
LinuxInside
279vsyscallandvDSO
void__initmap_vsyscall(void)
{
externchar__vsyscall_page;
unsignedlongphysaddr_vsyscall=__pa_symbol(&__vsyscall_page);
...
...
...
}
Aswecansee,atthebeginningofthemap_vsyscallfunctionwegetthephysicaladdressofthevsyscallpagewiththe__pa_symbolmacro(wealreadysawimplementationifthismacrointhefourthpathoftheLinuxkernelinitializationprocess).The__vsyscall_pagesymboldefinedinthearch/x86/entry/vsyscall/vsyscall_emu_64.Sassemblysourcecodefileandhavethefollowingvirtualaddress:
ffffffff81881000D__vsyscall_page
inthe.data..page_aligned,awsectionandcontainscallofthethreefolowingsystemcalls:
gettimeofday;time;getcpu.
Or:
__vsyscall_page:
mov$__NR_gettimeofday,%rax
syscall
ret
.balign1024,0xcc
mov$__NR_time,%rax
syscall
ret
.balign1024,0xcc
mov$__NR_getcpu,%rax
syscall
ret
Let'sgobacktotheimplementationofthemap_vsyscallfunctionandreturntotheimplementationofthe__vsyscall_page,later.Afterwereceivingthephysicaladdressofthe__vsyscall_page,wecheckthevalueofthevsyscall_modevariableandsetthefix-mappedaddressforthevsyscallpagewiththe__set_fixmapmacro:
if(vsyscall_mode!=NONE)
__set_fixmap(VSYSCALL_PAGE,physaddr_vsyscall,
vsyscall_mode==NATIVE
?PAGE_KERNEL_VSYSCALL
:PAGE_KERNEL_VVAR);
The__set_fixmaptakesthreearguments:Thefirstisindexofthefixed_addressesenum.InourcaseVSYSCALL_PAGEisthefirstelementofthefixed_addressesenumforthex86_64architecture:
enumfixed_addresses{
...
...
...
#ifdefCONFIG_X86_VSYSCALL_EMULATION
VSYSCALL_PAGE=(FIXADDR_TOP-VSYSCALL_ADDR)>>PAGE_SHIFT,
LinuxInside
280vsyscallandvDSO
#endif
...
...
...
Itequaltothe511.Thesecondargumentisthephysicaladdressofthethepagethathastobemappedandthethirdargumentistheflagsofthepage.NotethattheflagsoftheVSYSCALL_PAGEdependonthevsyscall_modevariable.ItwillbePAGE_KERNEL_VSYSCALLifthevsyscall_modevariableisNATIVEandthePAGE_KERNEL_VVARotherwise.Bothmacros(thePAGE_KERNEL_VSYSCALLandthePAGE_KERNEL_VVAR)willbeexpandedtothefollowingflags:
#define__PAGE_KERNEL_VSYSCALL(__PAGE_KERNEL_RX|_PAGE_USER)
#define__PAGE_KERNEL_VVAR(__PAGE_KERNEL_RO|_PAGE_USER)
thatrepresentaccessrightstothevsyscallpage.Bothflagshavethesame_PAGE_USERflagsthatmeansthatthepagecanbeaccessedbyauser-modeprocessrunningatlowerprivilegelevels.Thesecondflagdependsonthevalueofthevsyscall_modevariable.Thefirstflag(__PAGE_KERNEL_VSYSCALL)willbesetinthecasewherevsyscall_modeisNATIVE.Thismeansvirtualsystemcallswillbenativesyscallinstructions.InotherwaythevsyscallwillhavePAGE_KERNEL_VVARifthevsyscall_modevariablewillbeemulate.Inthiscasevirtualsystemcallswillbeturnedintotrapsandareemulatedreasonably.Thevsyscall_modevariablegetsitsvalueinthevsyscall_setupfunction:
staticint__initvsyscall_setup(char*str)
{
if(str){
if(!strcmp("emulate",str))
vsyscall_mode=EMULATE;
elseif(!strcmp("native",str))
vsyscall_mode=NATIVE;
elseif(!strcmp("none",str))
vsyscall_mode=NONE;
else
return-EINVAL;
return0;
}
return-EINVAL;
}
Thatwillbecalledduringearlykernelparametersparsing:
early_param("vsyscall",vsyscall_setup);
Moreaboutearly_parammacroyoucanreadinthesixthpartofthechapterthatdescribesprocessoftheinitializationoftheLinuxkernel.
Intheendofthevsyscall_mapfunctionwejustcheckthatvirtualaddressofthevsyscallpageisequaltothevalueoftheVSYSCALL_ADDRwiththeBUILD_BUG_ONmacro:
BUILD_BUG_ON((unsignedlong)__fix_to_virt(VSYSCALL_PAGE)!=
(unsignedlong)VSYSCALL_ADDR);
That'sall.vsyscallpageissetup.Theresultofthealltheaboveisthefollowing:Ifwepassvsyscall=nativeparametertothekernelcommandline,virtualsystemcallswillbehandledasnativesyscallinstructionsinthearch/x86/entry/vsyscall/vsyscall_emu_64.S.Theglibcknowsaddressesofthevirtualsystemcallhandlers.Notethatvirtualsystemcallhandlersarealignedby1024(or0x400)bytes:
LinuxInside
281vsyscallandvDSO
__vsyscall_page:
mov$__NR_gettimeofday,%rax
syscall
ret
.balign1024,0xcc
mov$__NR_time,%rax
syscall
ret
.balign1024,0xcc
mov$__NR_getcpu,%rax
syscall
ret
Andthestartaddressofthevsyscallpageistheffffffffff600000everytime.So,theglibcknowstheaddressesoftheallvirutalsystemcallhandlers.Youcanfinddefinitionoftheseaddressesintheglibcsourcecode:
#defineVSYSCALL_ADDR_vgettimeofday0xffffffffff600000
#defineVSYSCALL_ADDR_vtime0xffffffffff600400
#defineVSYSCALL_ADDR_vgetcpu0xffffffffff600800
Allvirtualsystemcallrequestswillfallintothe__vsyscall_page+VSYSCALL_ADDR_vsyscall_nameoffset,putthenumberofavirtualsystemcalltotheraxgeneralpurposeregisterandthenativeforthex86_64syscallinstructionwillbeexecuted.
Inthesecondcase,ifwepassvsyscall=emulateparametertothekernelcommandline,anattempttoperformvirtualsystemcallhandlerwillcauseapagefaultexception.Ofcourse,remember,thevsyscallpagehas__PAGE_KERNEL_VVARaccessrightsthatforbidexecution.Thedo_page_faultfunctionisthe#PForpagefaulthandler.Ittriestounderstandthereasonofthelastpagefault.Andoneofthereasoncanbesituationwhenvirtualsystemcallcalledandvsyscallmodeisemulate.Inthiscasevsyscallwillbehandledbytheemulate_vsyscallfunctionthatdefinedinthearch/x86/entry/vsyscall/vsyscall_64.csourcecodefile.
Theemulate_vsyscallfunctiongetsthenumberofavirtualsystemcall,checksit,printserrorandsendssegementationfaultsingle:
...
...
...
vsyscall_nr=addr_to_vsyscall_nr(address);
if(vsyscall_nr<0){
warn_bad_vsyscall(KERN_WARNING,regs,"misalignedvsyscall...);
gotosigsegv;
}
...
...
...
sigsegv:
force_sig(SIGSEGV,current);
reutrntrue;
Asitcheckednumberofavirtualsystemcall,itdoessomeyetanothercheckslikeaccess_okviolationsandexecutesystemcallfunctiondependsonthenumberofavirtualsystemcall:
switch(vsyscall_nr){
case0:
ret=sys_gettimeofday(
(structtimeval__user*)regs->di,
(structtimezone__user*)regs->si);
break;
...
...
LinuxInside
282vsyscallandvDSO
...
}
Intheendweputtheresultofthesys_gettimeofdayoranothervirtualsystemcallhandlertotheaxgeneralpurposeregister,aswediditwiththenormalsystemcallsandrestoretheinstructionpointerregisterandadd8bytestothestackpointerregister.Thisoperationemulatesretinstruction.
regs->ax=ret;
do_ret:
regs->ip=caller;
regs->sp+=8;
returntrue;
That'sall.Nowlet'slookonthemodernconcept-vDSO.
AsIalreadywroteabove,vsyscallisanobsoleteconceptandreplacedbythevDSOorvirtualdynamicsharedobject.ThemaindifferencebetweenthevsyscallandvDSOmechanismsisthatvDSOmapsmemorypagesintoeachprocessinasharedobjectform,butvsyscallisstaticinmemoryandhasthesameaddresseverytime.Forthex86_64architectureitiscalled-linux-vdso.so.1.Alluserspaceapplicationslinkedwiththissharedlibraryviatheglibc.Forexample:
~$ldd/bin/uname
linux-vdso.so.1(0x00007ffe014b7000)
libc.so.6=>/lib64/libc.so.6(0x00007fbfee2fe000)
/lib64/ld-linux-x86-64.so.2(0x00005559aab7c000)
Or:
~$sudocat/proc/1/maps|grepvdso
7fff39f73000-7fff39f75000r-xp0000000000:000[vdso]
Herewecanseethatunameutilwaslinkedwiththethreelibraries:
linux-vdso.so.1;libc.so.6;ld-linux-x86-64.so.2.
ThefirstprovidesvDSOfunctionality,thesecondisCstandardlibraryandthethirdistheprograminterpreter(moreaboutthisyoucanreadinthepartthatdescribeslinkers).So,thevDSOsolveslimitationsofthevsyscall.ImplementationofthevDSOissimilartovsyscall.
InitializationofthevDSOoccursintheinit_vdsofunctionthatdefinedinthearch/x86/entry/vdso/vma.csourcecodefile.ThisfunctionstartsfromtheinitializationofthevDSOimagesfor32-bitsand64-bitsdependsontheCONFIG_X86_X32_ABIkernelconfigurationoption:
staticint__initinit_vdso(void)
{
init_vdso_image(&vdso_image_64);
#ifdefCONFIG_X86_X32_ABI
init_vdso_image(&vdso_image_x32);
#endif
IntroductiontovDSO
LinuxInside
283vsyscallandvDSO
Bothfunctioninitializethevdso_imagestructure.Thisstructureisdefinedinthetwogeneratedsourcecodefiles:thearch/x86/entry/vdso/vdso-image-64.candthearch/x86/entry/vdso/vdso-image-64.c.Thesesourcecodefilesgeneratedbythevdso2cprogramfromthedifferentsourcecodefiles,representdifferentapproachestocallasystemcalllikeint0x80,sysenterandetc.Thefullsetoftheimagesdependsonthekernelconfiguration.
Forexampleforthex86_64Linuxkernelitwillcontainvdso_image_64:
#ifdefCONFIG_X86_64
externconststructvdso_imagevdso_image_64;
#endif
Butforthex86-vdso_image_32:
#ifdefCONFIG_X86_X32
externconststructvdso_imagevdso_image_x32;
#endif
Ifourkernelisconfiguredforthex86architectureorforthex86_64andcompabilitymode,wewillhaveabilitytocallasystemcallwiththeint0x80interrupt,ifcompabilitymodeisenabled,wewillbeabletocallasystemcallwiththenativesyscallinstructionorsysenterinstructioninotherway:
#ifdefinedCONFIG_X86_32||definedCONFIG_COMPAT
externconststructvdso_imagevdso_image_32_int80;
#ifdefCONFIG_COMPAT
externconststructvdso_imagevdso_image_32_syscall;
#endif
externconststructvdso_imagevdso_image_32_sysenter;
#endif
Aswecanunderstandfromthenameofthevdso_imagestructure,itrepresentsimageofthevDSOforthecertainmodeofthesystemcallentry.ThisstructurecontainsinformationaboutsizeinbytesofthevDSOareathatalwaysamultipleofPAGE_SIZE(4096bytes),pointertothetextmapping,startandendaddressofthealternatives(setofinstructionswithbetteralternativesforthecertaintypeoftheprocessor)andetc.Forexamplevdso_image_64lookslikethis:
conststructvdso_imagevdso_image_64={
.data=raw_data,
.size=8192,
.text_mapping={
.name="[vdso]",
.pages=pages,
},
.alt=3145,
.alt_len=26,
.sym_vvar_start=-8192,
.sym_vvar_page=-8192,
.sym_hpet_page=-4096,
};
Wheretheraw_datacontainsrawbinarycodeofthe64-bitvDSOsystemcallswhichare2pagesize:
staticstructpage*pages[2];
or8Kilobytes.
LinuxInside
284vsyscallandvDSO
Theinit_vdso_imagefunctionisdefinedinthesamesourcecodefileandjustinitializesthevdso_image.text_mapping.pages.Firstofallthisfunctioncalculatesthenumberofpagesandinitializeseachvdso_image.text_mapping.pages[number_of_page]withthevirt_to_pagemacrothatconvertsgivenaddresstothepagestructure:
void__initinit_vdso_image(conststructvdso_image*image)
{
inti;
intnpages=(image->size)/PAGE_SIZE;
for(i=0;i<npages;i++)
image->text_mapping.pages[i]=
virt_to_page(image->data+i*PAGE_SIZE);
...
...
...
}
Theinit_vdsofunctionpassedtothesubsys_initcallmacroaddsthegivenfunctiontotheinitcallslist.Allfunctionsfromthislistwillbecalledinthedo_initcallsfunctionfromtheinit/main.csourcecodefile:
subsys_initcall(init_vdso);
Ok,wejustsawinitializationofthevDSOandinitializationofpagestructuresthatarerelatedtothememorypagesthatcontainvDSOsystemcalls.Buttowheredotheirpagesmap?Actuallytheyaremappedbythekernel,whenitloadsbinarytothememory.TheLinuxkernelcallsthearch_setup_additional_pagesfunctionfromthearch/x86/entry/vdso/vma.csourcecodefilethatchecksthatvDSOenabledforthex86_64andcallsthemap_vdsofunction:
intarch_setup_additional_pages(structlinux_binprm*bprm,intuses_interp)
{
if(!vdso64_enabled)
return0;
returnmap_vdso(&vdso_image_64,true);
}
Themap_vdsofunctionisdefinedinthesamesourcecodefileandmapspagesforthevDSOandforthesharedvDSOvariables.That'sall.ThemaindifferencesbetweenthevsyscallandthevDSOconceptsisthatvsyscalhasastaticaddressofffffffffff600000andimplements3systemcalls,whereasthevDSOloadsdynamicallyandimplementsfoursystemcalls:
__vdso_clock_gettime;__vdso_getcpu;__vdso_gettimeofday;__vdso_time.
That'sall.
ThisistheendofthethirdpartaboutthesystemcallsconceptintheLinuxkernel.InthepreviouspartwediscussedtheimplementationofthepreparationfromtheLinuxkernelside,beforeasystemcallwillbehandledandimplementationoftheexitprocessfromasystemcallhandler.Inthispartwecontinuedtodiveintothestuffwhichisrelatedtothesystemcallconceptandlearnedtwonewconceptsthatareverysimilartothesystemcall-thevsyscallandthevDSO.
Afterallofthesethreeparts,weknowalmostallthingsthatarerelatedtosystemcalls,weknowwhatsystemcallisand
Conclusion
LinuxInside
285vsyscallandvDSO
whyuserapplicationsneedthem.Wealsoknowwhatoccurswhenauserapplicationcallsasystemcallandhowthekernelhandlessystemcalls.
Thenextpartwillbethelastpartinthischapterandwewillseewhatoccurswhenauserrunstheprogram.
Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.
x86_64memorymapx86_64contextswitchingABIvirtualaddressSegmentationenumfix-mappedaddressesglibcBUILD_BUG_ONProcessorregisterPagefaultsegementationfaultinstructionpointerstackpointerunameLinkersPreviouspart
Links
LinuxInside
286vsyscallandvDSO
ThisisthefourthpartofthechapterthatdescribessystemcallsintheLinuxkernelandasIwroteintheconclusionoftheprevious-thispartwillbelastinthischapter.Inthepreviouspartwestoppedatthetwonewconcepts:
vsyscall;vDSO;
thatarerelatedandverysimilaronsystemcallconcept.
Thispartwillbelastpartinthischapterandasyoucanunderstandfromthepart'stitle-wewillseewhatdoesoccurintheLinuxkernelwhenwerunourprograms.So,let'sstart.
Therearemanydifferentwaystolaunchanapplicationfromanuserperspective.Forexamplewecanrunaprogramfromtheshellordouble-clickontheapplicationicon.Itdoesnotmatter.TheLinuxkernelhandlesapplicationlaunchregardlesshowwedolaunchthisapplication.
Inthispartwewillconsiderthewaywhenwejustlaunchanapplicationfromtheshell.Asyouknow,thestandardwaytolaunchanapplicationfromshellisthefollowing:Wejustlaunchaterminalemulatorapplicationandjustwritethenameoftheprogramandpassornotargumentstoourprogram,forexample:
Let'sconsiderwhatdoesoccurwhenwelaunchanapplicationfromtheshell,whatdoesshelldowhenwewriteprogramname,whatdoesLinuxkerneldoetc.Butbeforewewillstarttoconsidertheseinterestingthings,IwanttowarnthatthisbookisabouttheLinuxkernel.That'swhywewillseeLinuxkernelinternalsrelatedstuffmostlyinthispart.Wewillnotconsiderindetailswhatdoesshelldo,wewillnotconsidercomplexcases,forexamplesubshellsetc.
Mydefaultshellis-bash,soIwillconsiderhowdobashshelllaunchesaprogram.Solet'sstart.ThebashshellaswellasanyprogramthatwrittenwithCprogramminglanguagestartsfromthemainfunction.Ifyouwilllookonthesourcecodeofthebashshell,youwillfindthemainfunctionintheshell.csourcecodefile.Thisfunctionmakesmanydifferentthingsbeforethemainthreadloopofthebashstartedtowork.Forexamplethisfunction:
checksandtriestoopen/dev/tty;checkthatshellrunningindebugmode;parsescommandlinearguments;readsshellenvironment;loads.bashrc,.profileandotherconfigurationfiles;andmanymanymore.
Afteralloftheseoperationswecanseethecallofthereader_loopfunction.Thisfunctiondefinedintheeval.csourcecode
SystemcallsintheLinuxkernel.Part4.
HowdoestheLinuxkernelrunaprogram
howdowelaunchourprograms?
LinuxInside
287HowtheLinuxkernelrunsaprogram
fileandrepresentsmainthreadlooporinotherwordsitreadsandexecutescommands.Asthereader_loopfunctionmadeallchecksandreadthegivenprogramnameandarguments,itcallstheexecute_commandfunctionfromtheexecute_cmd.csourcecodefile.Theexecute_commandfunctionthroughthechainofthefunctionscalls:
execute_command
-->execute_command_internal
---->execute_simple_command
------>execute_disk_command
-------->shell_execve
makesdifferentcheckslikedoweneedtostartsubshell,wasitbuiltinbashfunctionornotetc.AsIalreadywroteabove,wewillnotconsideralldetailsaboutthingsthatarenotrelatedtotheLinuxkernel.Intheendofthisprocess,theshell_execvefunctioncallstheexecvesystemcall:
execve(command,args,env);
Theexecvesystemcallhasthefollowingsignature:
intexecve(constchar*filename,char*constargv[],char*constenvp[]);
andexecutesaprogrambythegivenfilename,withthegivenargumentsandenvironmentvariables.Thissystemcallisthefirstinourcaseandonly,forexample:
$stracels
execve("/bin/ls",["ls"],[/*62vars*/])=0
$straceecho
execve("/bin/echo",["echo"],[/*62vars*/])=0
$straceuname
execve("/bin/uname",["uname"],[/*62vars*/])=0
So,anuserapplication(bashinourcase)callsthesystemcallandaswealreadyknowthenextstepisLinuxkernel.
Wesawpreparationbeforeasystemcallcalledbyanuserapplicationandafterasystemcallhandlerfinisheditsworkinthesecondpartofthischapter.Westoppedatthecalloftheexecvesystemcallinthepreviousparagraph.Thissystemcalldefinedinthefs/exec.csourcecodefileandaswealreadyknowittakesthreearguments:
SYSCALL_DEFINE3(execve,
constchar__user*,filename,
constchar__user*const__user*,argv,
constchar__user*const__user*,envp)
{
returndo_execve(getname(filename),argv,envp);
}
Implementationoftheexecveisprettysimplehere,aswecanseeitjustreturnstheresultofthedo_execvefunction.Thedo_execvefunctiondefinedinthesamesourcecodefileanddothefollowingthings:
Initializetwopointersonauserspacedatawiththegivenargumentsandenvironmentvariables;
execvesystemcall
LinuxInside
288HowtheLinuxkernelrunsaprogram
returntheresultofthedo_execveat_common.
Wecanseeitsimplementation:
structuser_arg_ptrargv={.ptr.native=__argv};
structuser_arg_ptrenvp={.ptr.native=__envp};
returndo_execveat_common(AT_FDCWD,filename,argv,envp,0);
Thedo_execveat_commonfunctiondoesmainwork-itexecutesanewprogram.Thisfunctiontakessimilarsetofarguments,butasyoucanseeittakesfiveargumentsinsteadofthree.Thefirstargumentisthefiledescriptorthatrepresentdirectorywithourapplication,inourcasetheAT_FDCWDmeansthatthegivenpathnameisinterpretedrelativetothecurrentworkingdirectoryofthecallingprocess.Thefifthargumentisflags.Inourcasewepassed0tothedo_execveat_common.Wewillcheckinanextstep,sowillseeitlatter.
Firstofallthedo_execveat_commonfunctionchecksthefilenamepointerandreturnsifitisNULL.Afterthiswecheckflagsofthecurrentprocessthatlimitofrunningprocessesisnotexceed:
if(IS_ERR(filename))
returnPTR_ERR(filename);
if((current->flags&PF_NPROC_EXCEEDED)&&
atomic_read(¤t_user()->processes)>rlimit(RLIMIT_NPROC)){
retval=-EAGAIN;
gotoout_ret;
}
current->flags&=~PF_NPROC_EXCEEDED;
IfthesetwochecksweresuccessfulweunsetPF_NPROC_EXCEEDEDflagintheflagsofthecurrentprocesstopreventfailoftheexecve.Youcanseethatinthenextstepwecalltheunshare_filesfunctionthatdefinedinthekernel/fork.candunsharesthefilesofthecurrenttaskandchecktheresultofthisfunction:
retval=unshare_files(&displaced);
if(retval)
gotoout_ret;
Weneedtocallthisfunctiontoeliminatepotentialleakoftheexecve'dbinary'sfiledescriptor.Inthenextstepwestartpreparationofthebprmthatrepresentedbythestructlinux_binprmstructure(definedintheinclude/linux/binfmts.hheaderfile).Thelinux_binprmstructureisusedtoholdtheargumentsthatareusedwhenloadingbinaries.Forexampleitcontainsvmafieldwhichhasvm_area_structtypeandrepresentssinglememoryareaoveracontiguousintervalinagivenaddressspacewhereourapplicationwillbeloaded,mmfieldwhichismemorydescriptorofthebinary,pointertothetopofmemoryandmanyotherdifferentfields.
Firstofallweallocatememoryforthisstructurewiththekzallocfunctionandchecktheresultoftheallocation:
bprm=kzalloc(sizeof(*bprm),GFP_KERNEL);
if(!bprm)
gotoout_files;
Afterthiswestarttopreparethebinprmcredentialswiththecalloftheprepare_bprm_credsfunction:
retval=prepare_bprm_creds(bprm);
if(retval)
gotoout_free;
LinuxInside
289HowtheLinuxkernelrunsaprogram
check_unsafe_exec(bprm);
current->in_execve=1;
Initializationofthebinprmcredentialsinotherwordsisinitializationofthecredstructurethatstoredinsideofthelinux_binprmstructure.Thecredstructurecontainsthesecuritycontextofataskforexamplerealuidofthetask,realguidofthetask,uidandguidforthevirtualfilesystemoperationsetc.Inthenextstepasweexecutedpreparationofthebprmcredentialswecheckthatnowwecansafelyexecuteaprogramwiththecallofthecheck_unsafe_execfunctionandsetthecurrentprocesstothein_execvestate.
Afteralloftheseoperationswecallthedo_open_execatfunctionthatcheckstheflagsthatwepassedtothedo_execveat_commonfunction(rememberthatwehave0intheflags)andsearchesandopensexecutablefileondisk,checksthatourwewillloadabinaryfilefromnoexecmountpoints(weneedtoavoidexecuteabinaryfromfilesystemsthatdonotcontainexecutablebinarieslikeprocorsysfs),intializesfilestructureandreturnspointeronthisstructure.Nextwecanseethecallthesched_execafterthis:
file=do_open_execat(fd,filename,flags);
retval=PTR_ERR(file);
if(IS_ERR(file))
gotoout_unmark;
sched_exec();
Thesched_execfunctionisusedtodeterminetheleastloadedprocessorthatcanexecutethenewprogramandtomigratethecurrentprocesstoit.
Afterthisweneedtocheckfiledescriptorofthegiveexecutablebinary.Wetrytocheckdoesthenameoftheourbinaryfilestartsfromthe/symbolordoesthepathofthegivenexecutablebinaryisinterpretedrelativetothecurrentworkingdirectoryofthecallingprocessorinotherwordsfiledescriptorisAT_FDCWD(readaboveaboutthis).
Ifoneofthesechecksissuccessfullwesetthebinaryparameterfilename:
bprm->file=file;
if(fd==AT_FDCWD||filename->name[0]=='/'){
bprm->filename=filename->name;
}
Otherwiseifthefilenameisemptywesetthebinaryparameterfilenametothe/dev/fd/%dor/dev/fd/%d/%sdependsonthefilenameofthegivenexecutablebinarywhichmeansthatwewillexecutethefiletowhichthefiledescriptorrefers:
}else{
if(filename->name[0]=='\0')
pathbuf=kasprintf(GFP_TEMPORARY,"/dev/fd/%d",fd);
else
pathbuf=kasprintf(GFP_TEMPORARY,"/dev/fd/%d/%s",
fd,filename->name);
if(!pathbuf){
retval=-ENOMEM;
gotoout_unmark;
}
bprm->filename=pathbuf;
}
bprm->interp=bprm->filename;
Notethatwesetnotonlythebprm->filenamebutalsobprm->interpthatwillcontainnameoftheprograminterpreter.For
LinuxInside
290HowtheLinuxkernelrunsaprogram
nowwejustwritethesamenamethere,butlateritwillbeupdatedwiththerealnameoftheprograminterpreterdependsonbinaryformatofaprogram.Youcanreadabovethatwealreadypreparedcredforthelinux_binprm.Thenextstepisinitalizationofotherfieldsofthelinux_binprm.Firstofallwecallthebprm_mm_initfunctionandpassthebprmtoit:
retval=bprm_mm_init(bprm);
if(retval)
gotoout_unmark;
Thebprm_mm_initdefinedinthesamesourcecodefileandaswecanunderstandfromthefunction'sname,itmakesinitializationofthememorydescriptororinotherwordsthebprm_mm_initfunctioninitializesmm_structstructure.Thisstructuredefinedintheinclude/linux/mm_types.hheaderfileandrepresentsaddressspaceofaprocess.Wewillnotconsiderimplementationofthebprm_mm_initfunctionbecausewedonotknowmanyimportantstuffrelatedtotheLinuxkernelmemorymanager,butwejustneedtoknowthatthisfunctioninitializesmm_structandpopulateitwithatemporarystackvm_area_struct.
Afterthiswecalculatethecountofthecommandlineargumentswhicharewerepassedtotheourexecutablebinary,thecountoftheenvironmentvariablesandsetittothebprm->argcandbprm->envcrespectively:
bprm->argc=count(argv,MAX_ARG_STRINGS);
if((retval=bprm->argc)<0)
gotoout;
bprm->envc=count(envp,MAX_ARG_STRINGS);
if((retval=bprm->envc)<0)
gotoout;
Asyoucanseewedothisoperationswiththehelpofthecountfunctionthatdefinedinthesamesourcecodefileandcalculatesthecountofstringsintheargvarray.TheMAX_ARG_STRINGSmacrodefinedintheinclude/uapi/linux/binfmts.hheaderfileandaswecanunderstandfromthemacro'sname,itrepresentsmaximumnumberofstringsthatwerepassedtotheexecvesystemcall.ThevalueoftheMAX_ARG_STRINGS:
#defineMAX_ARG_STRINGS0x7FFFFFFF
Afterwecalculatedthenumberofthecommandlineargumentsandenvironmentvariables,wecalltheprepare_binprmfunction.Wealreadycallthefunctionwiththesimilarnamebeforethismoment.Thisfunctioniscalledprepare_binprm_credandwerememberthatthisfunctioninitializescredstructureinthelinux_bprm.Nowtheprepare_binprmfunction:
retval=prepare_binprm(bprm);
if(retval<0)
gotoout;
fillsthelinux_binprmstructurewiththeuidfrominodeandread128bytesfromthebinaryexecutablefile.Wereadonlyfirst128fromtheexecutablefilebecauseweneedtocheckatypeofourexecutable.Wewillreadtherestoftheexecutablefileinthelaterstep.Afterthepreparationofthelinux_bprmstructurewecopythefilenameoftheexecutablebinaryfile,commandlineargumentsandenviromentvariablestothelinux_bprmwiththecallofthecopy_strings_kernelfunction:
retval=copy_strings_kernel(1,&bprm->filename,bprm);
if(retval<0)
gotoout;
retval=copy_strings(bprm->envc,envp,bprm);
if(retval<0)
gotoout;
LinuxInside
291HowtheLinuxkernelrunsaprogram
retval=copy_strings(bprm->argc,argv,bprm);
if(retval<0)
gotoout;
Andsetthepointertothetopofnewprogram'sstackthatwesetinthebprm_mm_initfunction:
bprm->exec=bprm->p;
Thetopofthestackwillcontaintheprogramfilenameandwestorethisfilenemetotheexecfieldofthelinux_bprmstructure.
Nowwehavefilledlinux_bprmstructure,wecalltheexec_binprmfunction:
retval=exec_binprm(bprm);
if(retval<0)
gotoout;
Firstofallwestorethepidandpidthatseenfromthenamespaceofthecurrenttaskintheexec_binprm:
old_pid=current->pid;
rcu_read_lock();
old_vpid=task_pid_nr_ns(current,task_active_pid_ns(current->parent));
rcu_read_unlock();
andcallthe:
search_binary_handler(bprm);
function.Thisfunctiongoesthroughthelistofhandlersthatcontainsdifferentbinaryformats.CurrentlytheLinuxkernelsupportsfollowingbinaryformats:
binfmt_script-supportforinterpretedscriptsthatarestartsfromthe#!line;binfmt_misc-supportdifferntbinaryformats,accordingtoruntimeconfigurationoftheLinuxkernel;binfmt_elf-supportelfformat;binfmt_aout-supporta.outformat;binfmt_flat-supportforflatformat;binfmt_elf_fdpic-SupportforelfFDPICbinaries;binfmt_em86-supportforIntelelfbinariesrunningonAlphamachines.
So,thesearch-binary_handlertriestocalltheload_binaryfunctionandpasslinux_binprmtoit.Ifthebinaryhandlersupportsthegivenexecutablefileformat,itstartstopreparetheexecutablebinaryforexecution:
intsearch_binary_handler(structlinux_binprm*bprm)
{
...
...
...
list_for_each_entry(fmt,&formats,lh){
retval=fmt->load_binary(bprm);
if(retval<0&&!bprm->mm){
force_sigsegv(SIGSEGV,current);
returnretval;
}
LinuxInside
292HowtheLinuxkernelrunsaprogram
}
returnretval;
Wheretheload_binaryforexamplefortheelfchecksthemagicnumber(eachelfbinaryfilecontainsmagicnumberintheheader)inthelinux_bprmbuffer(rememberthatwereadfirst128bytesfromtheexecutablebinaryfile):andexitifitisnotelfbinary:
staticintload_elf_binary(structlinux_binprm*bprm)
{
...
...
...
loc->elf_ex=*((structelfhdr*)bprm->buf);
if(memcmp(elf_ex.e_ident,ELFMAG,SELFMAG)!=0)
gotoout;
Ifthegivenexecutablefileisinelfformat,theload_elf_binarycontinuestoexecute.Theload_elf_binarydoesmanydifferentthingstoprepareonexecutionexecutablefile.Forexampleitchecksthearchitectureandtypeoftheexecutablefile:
if(loc->elf_ex.e_type!=ET_EXEC&&loc->elf_ex.e_type!=ET_DYN)
gotoout;
if(!elf_check_arch(&loc->elf_ex))
gotoout;
andexitifthereiswrongarchitectureandexecutablefilenonexecutablenonshared.Triestoloadtheprogramheadertable:
elf_phdata=load_elf_phdrs(&loc->elf_ex,bprm->file);
if(!elf_phdata)
gotoout;
thatdescribessegments.Readtheprograminterpreterandlibrariesthatlinkedwiththeourexecutablebinaryfilefromdiskandloadittomemory.Theprograminterpreterspecifiedinthe.interpsectionoftheexecutablefileandasyoucanreadinthepartthatdescribesLinkersitis-/lib64/ld-linux-x86-64.so.2forthex86_64.Itsetupsthestackandmapelfbinaryintothecorrectlocationinmemory.Itmapsthebssandthebrksectionsanddoesmanymanyotherdifferentthingstoprepareexecutablefiletoexecute.
Intheendoftheexecutionoftheload_elf_binarywecallthestart_threadfunctionandpassthreeargumentstoit:
start_thread(regs,elf_entry,bprm->p);
retval=0;
out:
kfree(loc);
out_ret:
returnretval;
Theseargumentsare:
Setofregistersforthenewtask;Addressoftheentrypointofthenewtask;Addressofthetopofthestackforthenewtask.
LinuxInside
293HowtheLinuxkernelrunsaprogram
Aswecanunderstandfromthefunction'sname,itstartsnewthread,butitisnotso.Thestart_threadfunctionjustpreparesnewtask'sregisterstobereadytorun.Let'slookontheimplementationofthisfunction:
void
start_thread(structpt_regs*regs,unsignedlongnew_ip,unsignedlongnew_sp)
{
start_thread_common(regs,new_ip,new_sp,
__USER_CS,__USER_DS,0);
}
Aswecanseethestart_threadfunctionjustmakesacallofthestart_thread_commonfunctionthatwilldoallforus:
staticvoid
start_thread_common(structpt_regs*regs,unsignedlongnew_ip,
unsignedlongnew_sp,
unsignedint_cs,unsignedint_ss,unsignedint_ds)
{
loadsegment(fs,0);
loadsegment(es,_ds);
loadsegment(ds,_ds);
load_gs_index(0);
regs->ip=new_ip;
regs->sp=new_sp;
regs->cs=_cs;
regs->ss=_ss;
regs->flags=X86_EFLAGS_IF;
force_iret();
}
Thestart_thread_commonfunctionfillsfssegmentregisterwithzeroandesanddswiththevalueofthedatasegmentregister.Afterthiswesetnewvaluestotheinstructionpointer,cssegmentsetc.Intheendofthestart_thread_commonfunctionwecanseetheforce_iretmacrothatforceasystemcallreturnviairetinstruction.Ok,wepreparednewthreadtoruninuserspaceandnowwecanreturnfromtheexec_binprmandnowweareinthedo_execveat_commonagain.Aftertheexec_binprmwillfinishitsexecutionwereleasememoryforstructuresthatwasallocatedbeforeandreturn.
Afterwereturnedfromtheexecvesystemcallhandler,executionofourprogramwillbestarted.Wecandoit,becauseallcontextrelatedinformationalreadyconfiguredforthispurpose.Aswesawtheexecvesystemcalldoesnotreturncontroltoaprocess,butcode,dataandothersegmentsofthecallerprocessarejustoverwrittenoftheprogramsegments.Theexitfromourapplicationwillbeimplementedthroughtheexitsystemcall.
That'sall.Fromthispointourprogrammwillbeexecuted.
ThisistheendofthefourthandlastpartoftheaboutthesystemcallsconceptintheLinuxkernel.Wesawalmostallrelatedstufftothesystemcallconceptinthesefourparts.Westartedfromtheunderstandingofthesystemcallconcept,wehavelearnedwhatisitandwhydousersapplicationsneedinthisconcept.NextwesawhowdoestheLinuxhandleasystemcallfromanuserapplication.Wemettwosimilarconceptstothesystemcallconcept,theyarevsyscallandvDSOandfinallywesawhowdoesLinuxkernelrunanuserprogram.
Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.
Conclusion
Links
LinuxInside
294HowtheLinuxkernelrunsaprogram
SystemcallshellbashentrypointCenvironmentvariablesfiledescriptorrealuidvirtualfilesystemprocfssysfsinodepidnamespace#!elfa.outflatAlphaFDPICsegmentsLinkersProcessorregisterinstructionpointerPreviouspart
LinuxInside
295HowtheLinuxkernelrunsaprogram
Thischapterdescribestimersandtimemanagementrelatedconceptsinthelinuxkernel.
Introduction-thispartisintroductiontothetimersintheLinuxkernel.Introductiontotheclocksourceframework-thispartdescribesclocksourceframeworkintheLinuxkernel.
Timersandtimemanagement
LinuxInside
296Timersandtimemanagement
Thisisyetanotherpostthatopensnewchapterinthelinux-insidesbook.Thepreviouspartwasalistpartofthechapterthatdescribessystemcallconceptandnowtimeistostartnewchapter.Asyoucanunderstandfromthepost'stitle,thischapterwillbedevotedtothetimersandtimemanagementintheLinuxkernel.Thechoiceoftopicforthecurrentchapterisnotaccidental.TimersandgenerallytimemanagementareveryimportantandwidelyusedintheLinuxkernel.TheLinuxkernelusestimersforvarioustasks,differenttimeoutsforexampleinTCPimplementation,thekernelmustknowcurrenttime,schedulingasynchronousfunctions,nexteventinterruptschedulingandmanymanymore.
So,wewillstarttolearnimplementationofthedifferenttimemanagementrelatedstuffinthispart.WewillseedifferenttypesoftimersandhowdodifferentLinuxkernelsubsystemsusethem.AsalwayswewillstartfromtheearliestpartoftheLinuxkernelandwillgothroughinitializationprocessoftheLinuxkernel.WealreadydiditinthespecialchapterwhichdescribesinitializationprocessoftheLinuxkernel,butasyoumayrememberwemissiedsomethingsthere.Andoneofthemistheinitializationoftimers.
Let'sstart.
AftertheLinuxkernelwasdecompressed(moreaboutthisyoucanreadintheKerneldecompressionpart)thearchitecturenon-specificcodestartstoworkintheinit/main.csourcecodefile.Afterinitializationofthelockvalidator,initializationofcgroupsandsettingcanaryvaluewecanseethecallofthesetup_archfunction.
Asyoumayrememberthisfunctiondefinedinthearch/x86/kernel/setup.csourcecodefileandprepares/initializesarchitecture-specificstuff(forexampleitreservesplaceforbsssection,reservesplaceforinitrd,parseskernelcommandlineandmanymanyotherthings).Besidesthis,wecanfindsometimemanagementrelatedfunctionsthere.
Thefirstis:
x86_init.timers.wallclock_init();
Wealreadysawx86_initstructureinthechapterthatdescribesinitializationoftheLinuxkernel.ThisstructurecontainspointerstothedefaultsetupfunctionsforthedifferentplatformslikeIntelMID,IntelCE4100andetc.Thex86_initstructuredefinedinthearch/x86/kernel/x86_init.candasyoucanseeitdeterminesstandardPChardwarebydefault.
Aswecansee,thex86_initstructurehasx86_init_opstypethatprovidesasetoffunctionsforplatformspecificsetuplikereserviingstandardresources,platformspecificmemorysetup,initializationofinterrupthandlersandetc.Thisstructurelookslike:
structx86_init_ops{
structx86_init_resourcesresources;
structx86_init_mpparsempparse;
structx86_init_irqsirqs;
structx86_init_oemoem;
structx86_init_pagingpaging;
structx86_init_timerstimers;
structx86_init_iommuiommu;
structx86_init_pcipci;
};
TimersintheLinuxkernel.Part1.
Introduction
Initializationofnon-standardPChardwareclock
LinuxInside
297Introduction
Wecannotetimersfieldthathasx86_init_timerstypeandaswecanunderstandbyitsname-thisfieldisrelatedtotimemanagementandtimers.Thex86_init_timerscontainsfourfieldswhichareallfunctionsthatreturnspointeronvoid:
setup_percpu_clockev-setupthepercpuclockeventdeviceforthebootcpu;tsc_pre_init-platformfunctioncalledbeforeTSCinit;timer_init-initializetheplatformtimer;wallclock_init-initializethewallclockdevice.
So,aswealreadyknow,inourcasethewallclock_initexecutesinitializationofthewallclockdevice.Ifwewilllookonthex86_initstructure,wewillseethatwallclock_initpointstothex86_init_noop:
structx86_init_opsx86_init__initdata={
...
...
...
.timers={
.wallclock_init=x86_init_noop,
},
...
...
...
}
Wherethex86_init_noopisjustafunctionthatdoesnothing:
void__cpuinitx86_init_noop(void){}
forthestandardPChardware.Actually,thewallclock_initfunctionisusedintheIntelMIDplatform.Initializationofthex86_init.timers.wallclock_initlocatedinthearch/x86/platform/intel-mid/intel-mid.csourcecodefileinthex86_intel_mid_early_setupfunction:
void__initx86_intel_mid_early_setup(void)
{
...
...
...
x86_init.timers.wallclock_init=intel_mid_rtc_init;
...
...
...
}
Implementationoftheintel_mid_rtc_initfunctionisinthearch/x86/platform/intel-mid/intel_mid_vrtc.csourcecodefileandlooksprettyeasy.Firstofall,thisfunctionparsesSimpleFirmwareInterfaceM-Real-Time-Clocktableforthegettingsuchdevicestothesfi_mrtc_arrayarrayandinitializationoftheset_timeandget_timefunctions:
void__initintel_mid_rtc_init(void)
{
unsignedlongvrtc_paddr;
sfi_table_parse(SFI_SIG_MRTC,NULL,NULL,sfi_parse_mrtc);
vrtc_paddr=sfi_mrtc_array[0].phys_addr;
if(!sfi_mrtc_num||!vrtc_paddr)
return;
vrtc_virt_base=(void__iomem*)set_fixmap_offset_nocache(FIX_LNW_VRTC,
vrtc_paddr);
x86_platform.get_wallclock=vrtc_get_time;
LinuxInside
298Introduction
x86_platform.set_wallclock=vrtc_set_mmss;
}
That'sall,afterthisadevicebasedonIntelMIDwillbeabletogetgettimefromhardwareclock.AsIalreadywrote,thestandardPCx86_64architecturedoesnotsupportfunctionandjustdonothingduringcallofthisfunction.WejustsawinitializationoftherealtimeclockfortheIntelMIDarchitectureandnowtimestoreturntothegeneralx86_64architectureandwilllookonthetimemanagementrelatedstuffthere.
Ifwewillreturntothesetup_archfunctionwhichislocatedasyourememberinthearch/x86/kernel/setup.csourcecodefile,wewillseethenextcallofthetimemanagementrelatedfunction:
register_refined_jiffies(CLOCK_TICK_RATE);
Beforewewilllookontheimplementationofthisfunction,wemustknowaboutjiffy.Aswecanreadonwikipedia:
Jiffyisaninformaltermforanyunspecifiedshortperiodoftime
ThisdefinitionisverysimilartothejiffyintheLinuxkernel.Thereisglobalvariablewiththejiffieswhichholdsthenumberofticksthathaveoccurredsincethesystembooted.TheLinuxkernelsetsthisvariabletozero:
externunsignedlongvolatile__jiffy_datajiffies;
duringinitializationprocess.Thisglobalvariablewillbeincrementedeachtimeduringtimerinterrupt.Besidesthis,nearthejiffiesvariablewecanseedefinitionofthesimilarvariable
externu64jiffies_64;
ActuallyonlyoneofthesevariablesisinuseintheLinuxkernel.Anditdependsontheprocessortype.Forthex86_64itwillbeu64useandforthex86isunsignedlong.Wewillseethisifwewilllookonthearch/x86/kernel/vmlinux.lds.Slinkerscript:
#ifdefCONFIG_X86_32
...
jiffies=jiffies_64;
...
#else
...
jiffies_64=jiffies;
...
#endif
Inthecaseofx86_32thejiffieswillbelower32bitsofthejiffies_64variable.Schematically,wecanimagineitasfollows
jiffies_64
+-----------------------------------------------------+
|||
|||
Acquaintedwithjiffies
LinuxInside
299Introduction
||jiffieson`x86_32`|
|||
|||
+-----------------------------------------------------+
63310
Nowweknowalittletheoryaboutjiffiesandwecanreturntotheourfunction.Thereisnoarchitecture-specificimplementationforourfunction-theregister_refined_jiffies.Thisfunctionlocatedinthegenerickernelcode-kernel/time/jiffies.csourcecodefile.Mainpointoftheregister_refined_jiffiesisregistrationofthejiffyclocksource.Beforewewilllookontheimplementationoftheregister_refined_jiffiesfunction,wemustknowwhatisitclocksource.Aswecanreadinthecomments:
The`clocksource`ishardwareabstractionforafree-runningcounter.
I'mnotsureaboutyou,butthatdescriptiondidn'tgiveagoodunderstandingabouttheclocksourceconcept.Let'strytounderstandwhatisit,butwewillnotgodeeperbecausethistopicwillbedescribedinaseparatepartinmuchmoredetail.Themainpointoftheclocksourceistimekeepingabstractionorinverysimplewords-itprovidesatimevaluetothekernel.Wealreadyknowaboutjiffiesinterfacethatrepresentsnumberofticksthathaveoccurredsincethesystembooted.ItrepresentedbytheglobalvariableintheLinuxkernelandincrementedeachtimerinterrupt.TheLinuxkernelcanusejiffiesfortimemeasurement.Sowhydoweneedinseparatecontextliketheclocksource?Actuallydifferenthardwaredevicesprovidedifferentclocksourcesthatarewidelyintheircapabilities.Theavailabilityofmoreprecisetechniquesfortimeintervalsmeasurementishardware-dependent.
Forexamplex86hason-chipa64-bitcounterthatiscalledTimeStampCounteranditsfrequencycanbeequaltoprocessorfrequency.OrforexampleHighPrecisionEventTimerthatconsistsofa64-bitcounterofatleast10MHzfrequency.Twodifferenttimersandtheyarebothforx86.Ifwewilladdtimersfromotherarchitectures,thisonlymakesthisproblemmorecomplex.TheLinuxkernelprovidesclocksourceconcepttosolvetheproblem.
TheclocksourceconceptrepresentedbytheclocksourcestructureintheLinuxkernel.Thisstructuredefinedintheinclude/linux/clocksource.hheaderfileandcontainsacoupleoffieldsthatdescribeatimecounter.Forexampleitcontains-namefieldwhichisthenameofacounter,flagsfieldthatdescribesdifferentpropertiesofacounter,pointerstothesuspendandresumefunctions,andmanymore.
Let'slookontheclocksourcestructureforjiffiesthatdefinedinthekernel/time/jiffies.csourcecodefile:
staticstructclocksourceclocksource_jiffies={
.name="jiffies",
.rating=1,
.read=jiffies_read,
.mask=0xffffffff,
.mult=NSEC_PER_JIFFY<<JIFFIES_SHIFT,
.shift=JIFFIES_SHIFT,
.max_cycles=10,
};
Wecanseedefinitionofthedefaultnamehere-jiffies,thenextisratingfieldallowsthebestregisteredclocksourcetobechosenbytheclocksourcemanagementcodeavailableforthespecifiedhardware.Theratingmayhavefollowingvalue:
1-99-Onlyavailableforbootupandtestingpurposes;100-199-Functionalforrealuse,butnotdesired.200-299-Acorrectandusableclocksource.300-399-Areasonablyfastandaccurateclocksource.400-499-Theidealclocksource.Amust-usewhereavailable;
LinuxInside
300Introduction
Forexampleratingofthetimestampcounteris300,butratingofthehighprecisioneventtimeris250.Thenextfieldisread-ispointertothefunctionthatallowstoreadclocksource'scyclevalueorinotherwordsitjustreturnsjiffiesvariablewithcycle_ttype:
staticcycle_tjiffies_read(structclocksource*cs)
{
return(cycle_t)jiffies;
}
thatisjust64-bitunsignedtype:
typedefu64cycle_t;
Thenextfieldisthemaskvalueensuresthatsubtractionbetweencountersvaluesfromnon64bitcountersdonotneedspecialoverflowlogic.Inourcasethemaskis0xffffffffanditis32bits.Thismeansthatjiffywrapsaroundtozeroafter42seconds:
>>>0xffffffff
4294967295
#42nanoseconds
>>>42*pow(10,-9)
4.2000000000000006e-08
#43nanoseconds
>>>43*pow(10,-9)
4.3e-08
Thenexttwofieldsmultandshiftareusedtoconverttheclocksource'speriodtonanosecondspercycle.Whenthekernelcallstheclocksource.readfunction,thisfunctionreturnsvalueinmachinetimeunitsrepresentedwithcycle_tdatatypethatwesawjustnow.Toconvertthisreturnvaluetothenanosecondsweneedinthesetwofields:multandshift.Theclocksourceprovidesclocksource_cyc2nsfunctionthatwilldoitforuswiththefollowingexpression:
((u64)cycles*mult)>>shift;
Aswecanseethemultfieldisequal:
NSEC_PER_JIFFY<<JIFFIES_SHIFT
#defineNSEC_PER_JIFFY((NSEC_PER_SEC+HZ/2)/HZ)
#defineNSEC_PER_SEC1000000000L
bydefault,andtheshiftis
#ifHZ<34
#defineJIFFIES_SHIFT6
#elifHZ<67
#defineJIFFIES_SHIFT7
#else
#defineJIFFIES_SHIFT8
#endif
ThejiffiesclocksourceusestheNSEC_PER_JIFFYmultiplierconversiontospecifythenanosecondovercycleratio.NotethatvaluesoftheJIFFIES_SHIFTandNSEC_PER_JIFFYdependonHZvalue.TheHZrepresentsthefrequencyofthesystemtimer.Thismacrodefinedintheinclude/asm-generic/param.handdependsontheCONFIG_HZkernelconfigurationoption.
LinuxInside
301Introduction
ThevalueofHZdiffersforeachsupportedarchitecture,butforx86it'sdefinedlike:
#defineHZCONFIG_HZ
WhereCONFIG_HZcanbeoneofthefollowingvalues:
Thismeansthatinourcasethetimerinterruptfrequencyis250HZoroccurs250timespersecondoronetimerinterrupteach4ms.
Thelastfieldthatwecanseeinthedefinitionoftheclocksource_jiffiesstructureisthe-max_cyclesthatholdsthemaximumcyclevaluethatcansafelybemultipliedwithoutpotentiallycausinganoverflow.
Ok,wejustsawdefinitionofthe`clocksource_jiffies`structure,alsoweknowalittleabout`jiffies`and`clocksource`,nowistimetogetbacktotheimplementationoftheourfunction.Inthebeginningofthispartwehavestoppedonthecallofthe:
register_refined_jiffies(CLOCK_TICK_RATE);
functionfromthearch/x86/kernel/setup.csourcecodefile.
AsIalreadywrote,themainpurposeoftheregister_refined_jiffiesfunctionistoregisterrefined_jiffiesclocksource.Wealreadysawtheclocksource_jiffiesstructurerepresentsstandardjiffiesclocksource.Now,ifyoulookinthekernel/time/jiffies.csourcecodefile,youwillfindyetanotherclocksourcedefinition:
structclocksourcerefined_jiffies;
Thereisonedifferentbetweenrefined_jiffiesandclocksource_jiffies:Thestandardjiffiesbasedclocksourceisthelowestcommondenominatorclocksourcewhichshouldfunctiononallsystems.Aswealreadyknow,thejiffiesglobalvariablewillbeincrementedduringeachtimerinterrupt.Thismeansthatstandardjiffiesbasedclocksourcehasthe
LinuxInside
302Introduction
sameresolutionasthetimerinterruptfrequency.Fromthiswecanunderstandthatstandardjiffiesbasedclocksourcemaysufferfrominaccuracies.Therefined_jiffiesusesCLOCK_TICK_RATEasthebaseofjiffiesshift.
Let'slookontheimplementationofthisfunction.Firstofallwecanseethattherefined_jiffiesclocksourcebasedontheclocksource_jiffiesstructure:
intregister_refined_jiffies(longcycles_per_second)
{
u64nsec_per_tick,shift_hz;
longcycles_per_tick;
refined_jiffies=clocksource_jiffies;
refined_jiffies.name="refined-jiffies";
refined_jiffies.rating++;
...
...
...
Herewecanseethatweupdatethenameoftherefined_jiffiestorefined-jiffiesandincrementtheratingofthisstructure.Asyouremember,theclocksource_jiffieshasrating-1,soourrefined_jiffiesclocksourcewillhaverating-2.Thismeansthattherefined_jiffieswillbebestselectionforclocksourcemanagementcode.
Inthenextstepweneedtocalculatenumberofcyclesperonetick:
cycles_per_tick=(cycles_per_second+HZ/2)/HZ;
NotethatwehaveusedNSEC_PER_SECmacroasthebaseofthestandardjiffiesmultiplier.Hereweareusingthecycles_per_secondwhichisthefirstparameteroftheregister_refined_jiffiesfunction.We'vepassedtheCLOCK_TICK_RATEmacrototheregister_refined_jiffiesfunction.Thismacrodefiniedinthearch/x86/include/asm/timex.hheaderfileandexpandstothe:
#defineCLOCK_TICK_RATEPIT_TICK_RATE
wherethePIT_TICK_RATEmacroexpandstothefrequencyoftheIntel8253:
#definePIT_TICK_RATE1193182ul
Afterthiswecalculateshift_hzfortheregister_refined_jiffiesthatwillstorehz<<8orinotherwordsfrequencyofthesystemtimer.Weshiftleftthecycles_per_secondorfrequencyoftheprogrammableintervaltimeron8inordertogetextraaccuracy:
shift_hz=(u64)cycles_per_second<<8;
shift_hz+=cycles_per_tick/2;
do_div(shift_hz,cycles_per_tick);
InthenextstepwecalculatethenumberofsecondsperonetickbyshiftinglefttheNSEC_PER_SECon8tooaswediditwiththeshift_hzanddothesamecalculationasbefore:
nsec_per_tick=(u64)NSEC_PER_SEC<<8;
nsec_per_tick+=(u32)shift_hz/2;
do_div(nsec_per_tick,(u32)shift_hz);
LinuxInside
303Introduction
refined_jiffies.mult=((u32)nsec_per_tick)<<JIFFIES_SHIFT;
Intheendoftheregister_refined_jiffiesfunctionweregisternewclocksourcewiththe__clocksource_registerfunctionthatdefinedintheinclude/linux/clocksource.hheaderfileandreturn:
__clocksource_register(&refined_jiffies);
return0;
TheclocksourcemanagementcodeprovidestheAPIforclocksourceregistrationandselection.Aswecansee,clocksourcesareregisteredbycallingthe__clocksource_registerfunctionduringkernelinitializationorfromakernelmodule.Duringregistration,theclocksourcemanagementcodewillchoosethebestclocksourceavailableinthesystemusingtheclocksource.ratingfieldwhichwealreadysawwhenweinitializedclocksourcestructureforjiffes.
Wejustsawinitializationoftwojiffiesbasedclocksourcesinthepreviousparagraph:
standardjiffiesbasedclocksource;refinedjiffiesbasedclocksource;
Don'tworryifyoudon'tunderstandthecalculationshere.Theylookfrighteningatfirst.Soon,stepbystepwewilllearnthesethings.So,wejustsawinitializationofjffiesbasedclocksourcesandalsoweknowthattheLinuxkernelhastheglobalvariablejiffiesthatholdsthenumberofticksthathaveoccuredsincethekernelstartedtowork.Now,let'slookhowtouseit.Tousejiffieswejustcanusejiffiesglobalvariablebyitsnameorwiththecalloftheget_jiffies_64function.Thisfunctiondefinedinthekernel/time/jiffies.csourcecodefileandjustreturnsfullfull64-bitvalueofthejiffies:
u64get_jiffies_64(void)
{
unsignedlongseq;
u64ret;
do{
seq=read_seqbegin(&jiffies_lock);
ret=jiffies_64;
}while(read_seqretry(&jiffies_lock,seq));
returnret;
}
EXPORT_SYMBOL(get_jiffies_64);
Notethattheget_jiffies_64functiondoesnotimplementedasjiffies_readforexample:
staticcycle_tjiffies_read(structclocksource*cs)
{
return(cycle_t)jiffies;
}
Wecanseethatimplementationoftheget_jiffies_64ismorecomplex.Thereadingofthejiffies_64variableisimplementedusingseqlocks.Actuallythisisdoneformachinesthatcannotatomicallyreadthefull64-bitvalues.
Ifwecanaccessthejiffiesorthejiffies_64variablewecanconvertittohumantimeunits.Togetonesecondwecanusefollowingexpression:
Usingthejiffies
LinuxInside
304Introduction
jiffies/HZ
So,ifweknowthis,wecangetanytimeunits.Forexample:
/*Thirtysecondsfromnow*/
jiffies+30*HZ
/*Twominutesfromnow*/
jiffies+120*HZ
/*Tenmillisecondsfromnow*/
jiffies+HZ/1000
That'sall.
ThisconcludesthefirstpartcoveringtimeandtimemanagementrelatedconceptsintheLinuxkernel.Wemetfirsttwoconceptsanditsinitializationinthispart:jiffiesandclocksource.InthenextpartwewillcontinuetodiveintothisinterestingthemeandasIalreadywroteinthispartwewillacquaintedandtrytounderstandinternalsoftheseandothertimemanagementconceptsintheLinuxkernel.
Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.
systemcallTCPlockvalidatorcgroupsbssinitrdIntelMIDTSCvoidSimpleFirmwareInterfacex86_64realtimeclockJiffyhighprecisioneventtimernanosecondsIntel8253seqlockscloksourcedocumentationPreviouschapter
Conclusion
Links
LinuxInside
305Introduction
ThepreviouspartwasthefirstpartinthecurrentchapterthatdescribestimersandtimemanagementrelatedstuffintheLinuxkernel.Wegotacquaintedwithtwoconceptsinthepreviouspart:
jiffies
clocksource
Thefirstistheglobalvariablethatdefinedintheinclude/linux/jiffies.hheaderfileandrepresentscounterthatincrementedduringeachtimerinterrupt.Soifwecanaccessthisglobalvariableandweknowtimerinterruptratewecanconvertjiffiestothehumantimeunits.Aswealreadyknowthetimerinterruptraterepresentedbythecompile-timeconstantthatiscalledHZintheLinuxkernel.ThevalueoftheHZisequaltothevalueoftheCONFIG_HZkernelconfigurationoptionandifwewilllookinthearch/x86/configs/x86_64_defconfigkernelconfigurationfile,wewillseethat:
CONFIG_HZ_1000=y
kernelconfigurationoptionisset.ThismeansthatvalueoftheCONFIG_HZwillbe1000bydefaultforthex86_64architecture.So,ifwedividevaluesofjiffiesonthevalueoftheHZ:
jiffies/HZ
wewillgetamountofsecondsthatelapsedsincethebeginningofthemomentwhentheLinuxkernelstartedtoworkorinotherwordswewillgetsystemuptime.SincetheHZrepresentsamountofthetimerinterruptsinasecond,wecansetavalueforsometimeinthefuture.Forexample:
/*oneminutefromnow*/
unsignedlonglater=jiffies+60*HZ;
/*fiveminutesfromnow*/
unsignedlonglater=jiffies+5*60*HZ;
ThisisaverycommonpracticeintheLinuxkernel.Forexample,ifyouwilllookinthearch/x86/kernel/smpboot.csourcecodefile,youwillfindthedo_boot_cpufunction.Thisfunctionbootsallprocessorsbesidesbootstrapprocessor.Youcanfindapieceofcodethatwaitsfortensecondsforaresponsefromapplicationprocessor:
if(!boot_error){
timeout=jiffies+10*HZ;
while(time_before(jiffies,timeout)){
...
...
...
udelay(100);
}
...
...
...
}
Weassignjiffies+10*HZvaluetothetimeoutvariablehere.AsIthinkyoualreadyunderstood,thiswillmeanten
TimersintheLinuxkernel.Part2.
Introductiontotheclocksourceframework
LinuxInside
306Clocksourceframework
secondstimeout.Afterthisweareenteringtotheloopthatweusetime_beforemacrotocomparecurrentjiffiesvalueandourtimeout.
Orforexampleifwewilllookinthesound/isa/sscape.csourcecodefilewhichrepresentsdriverfortheEnsoniqSoundscapeElitesoundcard,wewillseetheobp_startup_ackfunctionthatwaitsgiventimeoutforOn-BoardProcessortoreturnitsstart-upacknowledgementsequence:
staticintobp_startup_ack(structsoundscape*s,unsignedtimeout)
{
unsignedlongend_time=jiffies+msecs_to_jiffies(timeout);
do{
...
...
...
x=host_read_unsafe(s->io_base);
...
...
...
if(x==0xfe||x==0xff)
return1;
msleep(10);
}while(time_before(jiffies,end_time));
return0;
}
So,youcanfindthatjiffiesvariableisverywidelyusedintheLinuxkernelcode.AsIalreadywrote,wemetyetanothernewtimemanagementrelatedconceptinthepreviouspart-clocksource.InthepreviouspartwejustsawalittledescrptionofthisconceptandsawAPIforaclocksourceregistration.Let'stakeacloserlookatthisconceptinthispart
TheclocksourceconceptrepresentsgenericAPIforclocksourcesmanagementintheLinuxkernel.Whydoweneedseparateframeworkforthis?Let'sgobacktothebeginning.ThetimeconceptisfundamentalconceptintheLinuxkernelandotheroperatingsystemkernels.Andthetimekeepingisanoneoneofthenecessitiestousethisconcept.ForexampleLinuxkernelmustknowandupdatethetimeelapsedsincesystemstartup,itmustdeterminehowlongthecurrentprocesshasbeenrunningforeveryprocessorandmanymanymore.WheretheLinuxkernelcangetinformationabouttime?FirstofallitisRealTimeClockorRTCthatrepresentsbytheanonvolatiledevice.Youcanfindasetofarchitecture-independendrealtimeclockdriversintheLinuxkernelinthedrivers/rtcdirectory.Besidesthis,eacharchitecturecanprovideadriverforthearchitecture-dependendrealtimeclock,forexample-CMOS/RTC-arch/x86/kernel/rtc.cforthex86architecture.Thesecondissystemtimer-timerthatexcitesinterruptswithaperiodicrate.Forexample,forIBMPCcompatiblesitwas-programmableintervaltimer.
WealreadyknowthatfortimekeepingpurposeswecanusejiffiesintheLinuxkernel.ThejiffiescanbeconsideredasreadonlyglobalvariablewhichisupdatedwithHZfrequency.WeknowthattheHZisacompile-timekernelparameterwhosereasonablerangeisfrom100to1000Hz.So,itisguaranteedtohaveaninterfacefortimemeasurementwith1-10millisecondsresolution.Besidesstandardjiffies,wesawtherefined_jiffiesclocksourceinthepreviouspartthatisbasedonthei8253/i8254programmableintervaltimertickratewhichisalmost1193182hertz.Sowecangetsomethingabout1microsecondresolutionwiththerefined_jiffies.Inthistime,nanosecondsarethefavoritechoiceforthetimevalueunitsofthegivenclocksource.
Theavailabilityofmoreprecisetechniquesfortimeintervalsmeasurementishardware-dependend.Wejustknewalittleaboutx86dependendtimershardware.Buteacharchitectureprovidesowntimershardware.Earliereacharchitecturehadownimplementationforthispurpose.SolutionofthisproblemisanabstractionlayerandassociatedAPIinacommoncodeframeworkformanagingvariousclocksourcesandindependentofthetimerinterrupt.Thiscommncodeframeworkbecame-clocksourceframework.
Introductiontoclocksource
LinuxInside
307Clocksourceframework
Generictimeofdayandclocksourcemanagementframeworkmovedlotoftimekeepingcodeintoarchitectureindependentportionofcode,witharchitectureportionreducedtodefiningandmanaginglowlevelhardwarepiecesofclocksources.Alargeamountoffundstomeasurethetimeintervalondifferentarchitectureswithdifferenthardwareisabigcomplexity.Implementationoftheeachclockreleatedserviceisstronglyassociatedwithanindividualhardwaredeviceandasyoucanunderstand,itresultsinsimilarimplementationsfordifferentarchitectures.
Withinthisframework,eachclocksourceisrequiredtomaintainarepresentationoftimeasamonotonicallyincreasingvalue.AswecanseeintheLinuxkernelcode,nanosecondsarethefavoritechoiceforthetimevalueunitsofaclocksourceinthistime.Oneofthemainpointoftheclocksourceframeworkistoallowanusertoselectclocksourceamongarangeofavailablehardwaredevicessupportingclockfunctionswhenconfiguringthesystemandselecting,accessingandscalingdifferentclocksources.
Thefundamentaloftheclocksourceframeworkistheclocksourcestructurethatdefinedintheinclude/linux/clocksource.hheaderfile.Wealreadysawsomefieldsthatareprovidedbytheclocksourcestrucutreinthepreviouspart.Let'slookonthefulldefinitionofthisstructureandtrytodescribeallofitsfields:
structclocksource{
cycle_t(*read)(structclocksource*cs);
cycle_tmask;
u32mult;
u32shift;
u64max_idle_ns;
u32maxadj;
#ifdefCONFIG_ARCH_CLOCKSOURCE_DATA
structarch_clocksource_dataarchdata;
#endif
u64max_cycles;
constchar*name;
structlist_headlist;
intrating;
int(*enable)(structclocksource*cs);
void(*disable)(structclocksource*cs);
unsignedlongflags;
void(*suspend)(structclocksource*cs);
void(*resume)(structclocksource*cs);
#ifdefCONFIG_CLOCKSOURCE_WATCHDOG
structlist_headwd_list;
cycle_tcs_last;
cycle_twd_last;
#endif
structmodule*owner;
}____cacheline_aligned;
Wealreadysawthefirstfieldoftheclocksourcestructureinthepreviouspart-itispointertothereadfunctionthatreturnsbestconterselectedbytheclocksourceframework.Forexampleweusejiffies_readfunctiontoreadjiffiesvalue:
staticstructclocksourceclocksource_jiffies={
...
.read=jiffies_read,
...
}
wherejiffies_readjustreturns:
staticcycle_tjiffies_read(structclocksource*cs)
{
return(cycle_t)jiffies;
}
Theclocksourcestructure
LinuxInside
308Clocksourceframework
Ortheread_tscfunction:
staticstructclocksourceclocksource_tsc={
...
.read=read_tsc,
...
};
forthetimestampcounterreading.
Thenextfieldismaskthatallowstoensurethatsubtractionbetweencountersvaluesfromnon64bitcountersdonotneedspecialoverflowlogic.Afterthemaskfield,wecanseetwofields:multandshift.Thesearethefieldsthatarebaseofmathematicalfunctionsthatareprovideabilitytoconverttimevaluesspecifictoeachclocksource.Inotherwordsthesetwofieldshelpustoconvertanabstractmachinetimeunitsofacountertonanoseconds.
Afterthesetwofieldswecanseethe64bitsmax_idle_nsfieldrepresentsmaxidletimepermittedbytheclocksourceinnanoseconds.WeneedinthisfieldfortheLinuxkernelwithenabledCONFIG_NO_HZkernelconfigurationoption.ThiskernelconfigurationoptionenablestheLinuxkerneltorunwithoutaregulartimertick(wewillseefullexplanationofthisinotherpart).Theproblemthatdynamictickallowsthekerneltosleepforperiodslongerthanasingletick,moreoversleeptimecouldbeunlimited.Themax_idle_nsfieldrepresentsthissleepinglimit.
Thenextfieldafterthemax_idle_nsisthemaxadjfieldwhichisthemaximumadjustmentvaluetomult.Themainformulabywhichweconvertcyclestothenanoseconds:
((u64)cycles*mult)>>shift;
isnot100%accurate.InsteadthenumberistakenascloseaspossibletoananosecondandmaxadjhelpstocorrectthisandallowsclocksourceAPItoavoidmultvaluesthatmightoverflowwhenadjusted.Thenextfourfieldsarepointerstothefunction:
enable-optionalfunctiontoenableclocksource;disable-optionalfunctiontodisableclocksource;suspend-suspendfunctionfortheclocksource;resume-resumefunctionfortheclocksource;
Thenextfieldisthemax_cyclesandaswecanunderstandfromitsname,thisfieldrepresentsmaximumcyclevaluebeforepotentialoverflow.Andthelastfieldisownerrepresentsreferencetoakernelmodulethatisownerofaclocksource.Thisisall.Wejustwentthroughallthestandardfieldsoftheclocksourcestructure.Butyoucannotedthatwemissedsomefieldsoftheclocksourcestructure.Wecandivideallofmissedfieldontwotypes:Fieldsofthefirsttypearealreadyknownforus.Forexample,theyarenamefieldthatrepresentsnameofaclocksource,theratingfieldthathelpstotheLinuxkerneltoselectthebestclocksourceandetc.Thesecondtype,fieldswhicharedependentfromthedifferentLinuxkernelconfigurationoptions.Let'slookonthesefields.
Thefirstfieldisthearchdata.Thisfieldhasarch_clocksource_datatypeanddependsontheCONFIG_ARCH_CLOCKSOURCE_DATAkernelconfigurationoption.Thisfieldisactualonlyforthex86andIA64architecturesforthismoment.Andagain,aswecanunderstandfromthefield'sname,itrepresentsarchitecture-specificdataforaclocksource.Forexample,itrepresentsvDSOclockmode:
structarch_clocksource_data{
intvclock_mode;
};
LinuxInside
309Clocksourceframework
forthex86architectures.WherethevDSOclockmodecanbeoneofthe:
#defineVCLOCK_NONE0
#defineVCLOCK_TSC1
#defineVCLOCK_HPET2
#defineVCLOCK_PVCLOCK3
Thelastthreefieldsarewd_list,cs_lastandthewd_lastdependsontheCONFIG_CLOCKSOURCE_WATCHDOGkernelconfigurationoption.Firstofalllet'strytounderstandwhatisitwhatchdog.Inasimplewords,watchdogisatimerthatisusedfordetectionofthecomputermalfunctionsandrecoveringfromit.Allofthesethreefieldscontainwatchdogrelateddatathatisusedbytheclocksourceframework.IfwewillgreptheLinuxkernelsourcecode,wewillseethatonlyarch/x86/KConfigkernelconfigurationfilecontainstheCONFIG_CLOCKSOURCE_WATCHDOGkernelconfigurationoption.So,whydox86andx86_64needinwatchdog?Youalreadymayknowthatallx86processorshasspecial64-bitregister-timestampcounter.Thisregistercontainsnumberofcyclessincethereset.Sometimesthetimestampcounterneedstobeverifiedagainstanotherclocksource.Wewillnotseeinitializationofthewatchdogtimerinthispart,beforethiswemustlearnmoreabouttimers.
That'sall.Fromthismomentweknowallfieldsoftheclocksourcestructure.Thisknowledgewillhelpustolearninternalsoftheclocksourceframework.
Wesawonlyonefunctionfromtheclocksourceframeworkinthepreviouspart.Thisfunctionwas-__clocksource_register.Thisfunctiondefinedintheinclude/linux/clocksource.hheaderfileandaswecanunderstandfromthefunction'sname,mainpointofthisfunctionistoregisternewclocksource.Ifwewilllookontheimplementationofthe__clocksource_registerfunction,wewillseethatitjustmakescallofthe__clocksource_register_scalefunctionandreturnsitsresult:
staticinlineint__clocksource_register(structclocksource*cs)
{
return__clocksource_register_scale(cs,1,0);
}
Beforewewillseeimplementationofthe__clocksource_register_scalefunction,wecanseethatclocksourceprovidesadditionalAPIforanewclocksourceregistration:
staticinlineintclocksource_register_hz(structclocksource*cs,u32hz)
{
return__clocksource_register_scale(cs,1,hz);
}
staticinlineintclocksource_register_khz(structclocksource*cs,u32khz)
{
return__clocksource_register_scale(cs,1000,khz);
}
Andallofthesefunctionsdothesame.Theyreturnvalueofthe__clocksource_register_scalefunctionbutwithdiffferentsetofparameters.The__clocksource_register_scalefunctiondefinedinthekernel/time/clocksource.csourcecodefile.Tounderstanddifferencebetweenthesefunctions,let'slookontheparametersoftheclocksource_register_khzfunction.Aswecansee,thisfunctiontakesthreeparameters:
cs-clocksourcetobeinstalled;scale-scalefactorofaclocksource.Inotherwords,ifwewillmultiplyvalueofthisparameteronfrequency,wewillgethzofaclocksource;
Newclocksourceregistration
LinuxInside
310Clocksourceframework
freq-clocksourcefrequencydividedbyscale.
Nowlet'slookontheimplementationofthe__clocksource_register_scalefunction:
int__clocksource_register_scale(structclocksource*cs,u32scale,u32freq)
{
__clocksource_update_freq_scale(cs,scale,freq);
mutex_lock(&clocksource_mutex);
clocksource_enqueue(cs);
clocksource_enqueue_watchdog(cs);
clocksource_select();
mutex_unlock(&clocksource_mutex);
return0;
}
Firstofallwecanseethatthe__clocksource_register_scalefunctionstartsfromthecallofthe__clocksource_update_freq_scalefunctionthatdefinedinthesamesourcecodefileandupdatesgivenclocksourcewiththenewfrequency.Let'slookontheimplementationofthisfunction.Inthefirststepweneedtocheckgivenfrequencyandifitwasnotpassedaszero,weneedtocalculatemultandshiftparametersforthegivenclocksource.Whydoweneedtocheckvalueofthefrequency?Actuallyitcanbezero.ifyouattentivelylookedontheimplementationofthe__clocksource_registerfunction,youmayhavenoticedthatwepassedfrequencyas0.Wewilldoitonlyforsomeclocksourcesthathaveselfdefinedmultandshiftparameters.Lookinthepreviouspartandyouwillseethatwesawcalculationofthemultandshiftforjiffies.The__clocksource_update_freq_scalefunctionwilldoitforusforotherclocksources.
Sointhestartofthe__clocksource_update_freq_scalefunctionwecheckthevalueofthefrequencyparameterandifisnotzeroweneedtocalculatemultandshiftforthegivenclocksource.Let'slookonthemultandshiftcalculation:
void__clocksource_update_freq_scale(structclocksource*cs,u32scale,u32freq)
{
u64sec;
if(freq){
sec=cs->mask;
do_div(sec,freq);
do_div(sec,scale);
if(!sec)
sec=1;
elseif(sec>600&&cs->mask>UINT_MAX)
sec=600;
clocks_calc_mult_shift(&cs->mult,&cs->shift,freq,
NSEC_PER_SEC/scale,sec*scale);
}
...
...
...
}
Herewecanseecalculationofthemaximumnumberofsecondswhichwecanrunbeforeaclocksourcecounterwilloverflow.Firstofallwefillthesecvariablewiththevalueofaclocksourcemask.Rememberthataclocksource'smaskrepresentsmaximumamountofbitsthatarevalidforthegivenclocksource.Afterthis,wecanseetwodivisionoperations.Atfirstwedivideoursecvariableonaclocksourcefrequencyandthanonscalefactor.Thefreqparametershowsushowmanytimerinterruptswillbeoccuredinonesecond.So,wedividemaskvaluethatrepresentsmaximumnumberofacounter(forexamplejiffy)onthefrequencyofatimerandwillgetthemaximumnumberofsecondsforthecertainclocksource.Theseconddivisionoperationwillgiveusmaximumnumberofsecondsforthecertainclocksourcedependsonitsscalefactorwhichcanbe1hertzor1kilohertz(10^Hz).
Afterwehavegotmaximumnumberofseconds,wecheckthisvalueandsetitto1or600dependsontheresultatthe
LinuxInside
311Clocksourceframework
nextstep.Thesevaluesismaximumsleepingtimeforaclocksourceinseconds.Inthenextstepwecanseecalloftheclocks_calc_mult_shift.Mainpointofthisfunctioniscalculationofthemultandshiftvaluesforagivenclocksource.Intheendofthe__clocksource_update_freq_scalefunctionwecheckthatjustcalculatedmultvalueofagivenclocksourcewillnotcauseoverflowafteradjustment,updatethemax_idle_nsandmax_cyclesvaluesofagivenclocksourcewiththemaximumnanosecondsthatcanbeconvertedtoaclocksourcecounterandprintresulttothekernelbuffer:
pr_info("%s:mask:0x%llxmax_cycles:0x%llx,max_idle_ns:%lldns\n",
cs->name,cs->mask,cs->max_cycles,cs->max_idle_ns);
thatwecanseeinthedmesgoutput:
$dmesg|grep"clocksource:"
[0.000000]clocksource:refined-jiffies:mask:0xffffffffmax_cycles:0xffffffff,max_idle_ns:1910969940391419ns
[0.000000]clocksource:hpet:mask:0xffffffffmax_cycles:0xffffffff,max_idle_ns:133484882848ns
[0.094084]clocksource:jiffies:mask:0xffffffffmax_cycles:0xffffffff,max_idle_ns:1911260446275000ns
[0.205302]clocksource:acpi_pm:mask:0xffffffmax_cycles:0xffffff,max_idle_ns:2085701024ns
[1.452979]clocksource:tsc:mask:0xffffffffffffffffmax_cycles:0x7350b459580,max_idle_ns:881591204237ns
Afterthe__clocksource_update_freq_scalefunctionwillfinishitswork,wecanreturnbacktothe__clocksource_register_scalefunctionthatwillregisternewclocksource.Wecanseethecallofthefollowingthreefunctions:
mutex_lock(&clocksource_mutex);
clocksource_enqueue(cs);
clocksource_enqueue_watchdog(cs);
clocksource_select();
mutex_unlock(&clocksource_mutex);
Notethatbeforethefirstwillbecalled,welocktheclocksource_mutexmutex.Thepointoftheclocksource_mutexmutexistoprotectcurr_clocksourcevariablewhichrepresentscurrentlyselectedclocksourceandclocksource_listvariablewhichrepresentslistthatcontainsregisteredclocksources.Now,let'slookonthesethreefunctions.
Thefirstclocksource_enqueuefunctionandothertwodefinedinthesamesourcecodefile.Wegothroughallalreadyregisteredclocksourcesorinotherwordswegothroughallelementsoftheclocksource_listandtriestofindbestplaceforagivenclocksource:
staticvoidclocksource_enqueue(structclocksource*cs)
{
structlist_head*entry=&clocksource_list;
structclocksource*tmp;
list_for_each_entry(tmp,&clocksource_list,list)
if(tmp->rating>=cs->rating)
entry=&tmp->list;
list_add(&cs->list,entry);
}
Intheendwejustinsertnewclocksourcetotheclocksource_list.Thesecondfunction-clocksource_enqueue_watchdogdoesalmostthesamethatpreviousfunction,butitinsertsnewclocksourcetothewd_listdependsonflangsofaclocksourceandstartsnewwatchdogtimer.AsIalreadywrote,wewillnotconsiderwatchdogrelatedstuffinthispartbutwilldoitinnextparts.
Thelastfunctionistheclocksource_select.Aswecanunderstandfromthefunction'sname,mainpointofthisfunction-selectthebestclocksourcefromregisteredclocksources.Thisfunctionconsistsonlyfromthecallofthefunctionhelper:
LinuxInside
312Clocksourceframework
staticvoidclocksource_select(void)
{
return__clocksource_select(false);
}
Notethatthe__clocksource_selectfunctiontakesoneparameter(falseinourcase).Thisboolparametershowshowtotraveresetheclocksource_list.Inourcasewepassfalsethatismeanthatwewillgothroughallentriesoftheclocksource_list.Wealreadyknowthatclocksourcewiththebestratingwillthefirstintheclocksource_listafterthecalloftheclocksource_enqueuefunction,sowecaneasilygetitfromthislist.Afterwefoundaclocksourcewiththebestrating,weswitchtoit:
if(curr_clocksource!=best&&!timekeeping_notify(best)){
pr_info("Switchedtoclocksource%s\n",best->name);
curr_clocksource=best;
}
Theresultofthisoperationwecanseeinthedmesgoutput:
$dmesg|grepSwitched
[0.199688]clocksource:Switchedtoclocksourcehpet
[2.452966]clocksource:Switchedtoclocksourcetsc
Notethatwecanseetwoclocksourcesinthedmesgoutput(hpetandtscinourcase).Yes,actuallytherecanbemanydifferentclocksourcesonaparticularhardware.SotheLinuxkernelknowsaboutallregisteredclocksourcesandswitchestoaclocksourcewithabetterratingeachtimeafterregistrationofanewclocksource.
Ifwewilllookonthebottomofthekernel/time/clocksource.csourcecodefile,wewillseethatithassysfsinterface.Maininitializationoccursintheinit_clocksource_sysfsfunctionwhichwillbecalledduringdeviceinitcalls.Let'slookontheimplementationoftheinit_clocksource_sysfsfunction:
staticstructbus_typeclocksource_subsys={
.name="clocksource",
.dev_name="clocksource",
};
staticint__initinit_clocksource_sysfs(void)
{
interror=subsys_system_register(&clocksource_subsys,NULL);
if(!error)
error=device_register(&device_clocksource);
if(!error)
error=device_create_file(
&device_clocksource,
&dev_attr_current_clocksource);
if(!error)
error=device_create_file(&device_clocksource,
&dev_attr_unbind_clocksource);
if(!error)
error=device_create_file(
&device_clocksource,
&dev_attr_available_clocksource);
returnerror;
}
device_initcall(init_clocksource_sysfs);
Firstofallwecanseethatitregistersaclocksourcesubsystemwiththecallofthesubsys_system_registerfunction.Inotherwords,afterthecallofthisfunction,wewillhavefollowingdirectory:
LinuxInside
313Clocksourceframework
$pwd
/sys/devices/system/clocksource
Afterthisstep,wecanseeregistrationofthedevice_clocksourcedevicewhichisrepresentedbythefollowingstructure:
staticstructdevicedevice_clocksource={
.id=0,
.bus=&clocksource_subsys,
};
andcreationofthreefiles:
dev_attr_current_clocksource;dev_attr_unbind_clocksource;dev_attr_available_clocksource.
Thesefileswillprovideinformationaboutcurrentclocksourceinthesystem,availableclocksourcesinthesystemandinterfacewhichallowstounbindtheclocksource.
Aftertheinit_clocksource_sysfsfunctionwillbeexecuted,wewillbeablefindsomeinformationaboutavaliableclocksourcesinthe:
$cat/sys/devices/system/clocksource/clocksource0/available_clocksource
tschpetacpi_pm
Orforexampleinformantionaboutcurrentclocksourceinthesystem:
$cat/sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Inthepreviouspart,wesawAPIfortheregistrationofthejiffiesclocksource,butdidn'tdiveintodetailsabouttheclocksourceframework.Inthispartwediditandsawimplementationofthenewclocksourceregistrationandselectionofaclocksourcewiththebestratingvalueinthesystem.Ofcourse,thisisnotallAPIthatclocksourceframeworkprovides.Thereacoupleadditionalfunctionslikeclocksource_unregisterforremovinggivenclocksourcefromtheclocksource_listandetc.ButIwillnotdescribethisfunctionsinthispart,becausetheyarenotimportantforusrightnow.Anywayifyouareinterestinginit,youcanfinditinthekernel/time/clocksource.c.
That'sall.
ThisistheendofthesecondpartofthechapterthatdescribestimersandtimermanagementrelatedstuffintheLinuxkernel.Inthepreviouspartgotacquaintedwiththefollowingtwoconcepts:jiffiesandclocksource.Inthispartwesawsomeexamplesofthejiffiesusageandknewmoredetailsabouttheclocksourceconcept.
Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-insides.
Conclusion
Links
LinuxInside
314Clocksourceframework
x86x86_64uptimeEnsoniqSoundscapeEliteRTCinterruptsIBMPCprogrammableintervaltimerHznanosecondsdmesgtimestampcounterloadablekernelmoduleIA64watchdogclockratemutexsysfspreviouspart
LinuxInside
315Clocksourceframework
Thischapterdescribesmemorymanagementinthelinuxkernel.Youwillseehereacoupleofpostswhichdescribedifferentpartsofthelinuxmemorymanagementframework:
Memblock-describesearlymemblockallocator.Fix-MappedAddressesandioremap-describesfix-mappedaddressesandearlyioremap.
Linuxkernelmemorymanagement
LinuxInside
316Memorymanagement
Memorymanagementisoneofthemostcomplex(andIthinkthatitisthemostcomplex)partsoftheoperatingsystemkernel.Inthelastpreparationsbeforethekernelentrypointpartwestoppedrightbeforecallofthestart_kernelfunction.Thisfunctioninitializesallthekernelfeatures(includingarchitecture-dependentfeatures)beforethekernelrunsthefirstinitprocess.Youmayrememberaswebuiltearlypagetables,identitypagetablesandfixmappagetablesintheboottime.Nocomplicatedmemorymanagementisworkingyet.Whenthestart_kernelfunctioniscalledwewillseethetransitiontomorecomplexdatastructuresandtechniquesformemorymanagement.Foragoodunderstandingoftheinitializationprocessinthelinuxkernelweneedtohaveaclearunderstandingofthesetechniques.ThischapterwillprovideanoverviewofthedifferentpartsofthelinuxkernelmemorymanagementframeworkanditsAPI,startingfromthememblock.
Memblockisoneofthemethodsofmanagingmemoryregionsduringtheearlybootstrapperiodwhiletheusualkernelmemoryallocatorsarenotupandrunningyet.PreviouslyitwascalledLogicalMemoryBlock,butwiththepatchbyYinghaiLu,itwasrenamedtothememblock.AsLinuxkernelforx86_64architectureusesthismethod.WealreadymetmemblockintheLastpreparationsbeforethekernelentrypointpart.Andnowtimetogetacquaintedwithitcloser.Wewillseehowitisimplemented.
Wewillstarttolearnmemblockfromthedatastructures.Definitionsofthealldatastructurescanbefoundintheinclude/linux/memblock.hheaderfile.
Thefirststructurehasthesamenameasthispartanditis:
structmemblock{
boolbottom_up;
phys_addr_tcurrent_limit;
structmemblock_typememory;-->arrayofmemblock_region
structmemblock_typereserved;-->arrayofmemblock_region
#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP
structmemblock_typephysmem;
#endif
};
Thisstructurecontainsfivefields.Firstisbottom_upwhichallowsallocatingmemoryinbottom-upmodewhenitistrue.Nextfieldiscurrent_limit.Thisfielddescribesthelimitsizeofthememoryblock.Thenextthreefieldsdescribethetypeofthememoryblock.Itcanbe:reserved,memoryandphysicalmemoryiftheCONFIG_HAVE_MEMBLOCK_PHYS_MAPconfigurationoptionisenabled.Nowweseeyetanotherdatastructure-memblock_type.Let'slookatitsdefinition:
structmemblock_type{
unsignedlongcnt;
unsignedlongmax;
phys_addr_ttotal_size;
structmemblock_region*regions;
};
Thisstructureprovidesinformationaboutmemorytype.Itcontainsfieldswhichdescribethenumberofmemoryregionswhichareinsidethecurrentmemoryblock,thesizeofallmemoryregions,thesizeoftheallocatedarrayofthememoryregionsandpointertothearrayofthememblock_regionstructures.memblock_regionisastructurewhichdescribesa
LinuxkernelmemorymanagementPart1.
Introduction
Memblock
LinuxInside
317Memblock
memoryregion.Itsdefinitionis:
structmemblock_region{
phys_addr_tbase;
phys_addr_tsize;
unsignedlongflags;
#ifdefCONFIG_HAVE_MEMBLOCK_NODE_MAP
intnid;
#endif
};
memblock_regionprovidesbaseaddressandsizeofthememoryregion,flagswhichcanbe:
#defineMEMBLOCK_ALLOC_ANYWHERE(~(phys_addr_t)0)
#defineMEMBLOCK_ALLOC_ACCESSIBLE0
#defineMEMBLOCK_HOTPLUG0x1
Alsomemblock_regionprovidesintegerfield-numanodeselector,iftheCONFIG_HAVE_MEMBLOCK_NODE_MAPconfigurationoptionisenabled.
Schematicallywecanimagineitas:
+---------------------------++---------------------------+
|memblock|||
|_______________________|||
||memory|||Arrayofthe|
||memblock_type|-|-->|membock_region|
||_______________________||||
||+---------------------------+
|_______________________|+---------------------------+
||reserved||||
||memblock_type|-|-->|Arrayofthe|
||_______________________|||memblock_region|
||||
+---------------------------++---------------------------+
Thesethreestructures:memblock,memblock_typeandmemblock_regionaremainintheMemblock.NowweknowaboutitandcanlookatMemblockinitializationprocess.
AsallAPIofthememblockdescribedintheinclude/linux/memblock.hheaderfile,allimplementationofthesefunctionisinthemm/memblock.csourcecodefile.Let'slookatthetopofthesourcecodefileandwewillseetheinitializationofthememblockstructure:
structmemblockmemblock__initdata_memblock={
.memory.regions=memblock_memory_init_regions,
.memory.cnt=1,
.memory.max=INIT_MEMBLOCK_REGIONS,
.reserved.regions=memblock_reserved_init_regions,
.reserved.cnt=1,
.reserved.max=INIT_MEMBLOCK_REGIONS,
#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP
.physmem.regions=memblock_physmem_init_regions,
.physmem.cnt=1,
.physmem.max=INIT_PHYSMEM_REGIONS,
#endif
.bottom_up=false,
Memblockinitialization
LinuxInside
318Memblock
.current_limit=MEMBLOCK_ALLOC_ANYWHERE,
};
Herewecanseeinitializationofthememblockstructurewhichhasthesamenameasstructure-memblock.Firstofallnoteon__initdata_memblock.Defenitionofthismacrolookslike:
#ifdefCONFIG_ARCH_DISCARD_MEMBLOCK
#define__init_memblock__meminit
#define__initdata_memblock__meminitdata
#else
#define__init_memblock
#define__initdata_memblock
#endif
YoucannotethatitdependsonCONFIG_ARCH_DISCARD_MEMBLOCK.Ifthisconfigurationoptionisenabled,memblockcodewillbeputtothe.initsectionanditwillbereleasedafterthekernelisbootedup.
Nextwecanseeinitializationofthememblock_typememory,memblock_typereservedandmemblock_typephysmemfieldsofthememblockstructure.Hereweareinterestedonlyinthememblock_type.regionsinitializationprocess.Notethateverymemblock_typefieldinitializedbythearraysofthememblock_region:
staticstructmemblock_regionmemblock_memory_init_regions[INIT_MEMBLOCK_REGIONS]__initdata_memblock;
staticstructmemblock_regionmemblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS]__initdata_memblock;
#ifdefCONFIG_HAVE_MEMBLOCK_PHYS_MAP
staticstructmemblock_regionmemblock_physmem_init_regions[INIT_PHYSMEM_REGIONS]__initdata_memblock;
#endif
Everyarraycontains128memoryregions.WecanseeitintheINIT_MEMBLOCK_REGIONSmacrodefinition:
#defineINIT_MEMBLOCK_REGIONS128
Notethatallarraysarealsodefinedwiththe__initdata_memblockmacrowhichwealreadysawinthememblockstrucutreinitialization(readaboveifyou'veforgot).
Thelasttwofieldsdescribethatbottom_upallocationisdisabledandthelimitofthecurrentMemblockis:
#defineMEMBLOCK_ALLOC_ANYWHERE(~(phys_addr_t)0)
whichis0xffffffffffffffff.
OnthisstepinitializationofthememblockstructurefinishedandwecanlookontheMemblockAPI.
OkwehavefinishedwithinitilizationofthememblockstructureandnowwecanlookontheMemblockAPIanditsimplementation.AsIsaidabove,allimplementationofthememblockpresentedinthemm/memblock.c.Tounderstandhowmemblockworksandisimplemented,let'slookatitsusagefirstofall.Thereareacoupleofplacesinthelinuxkernelwherememblockisused.Forexamplelet'stakememblock_x86_fillfunctionfromthearch/x86/kernel/e820.c.Thisfunctiongoesthroughthememorymapprovidedbythee820andaddsmemoryregionsreservedbythekerneltothememblockwiththememblock_addfunction.Aswemetmemblock_addfunctionfirst,let'sstartfromit.
Thisfunctiontakesphysicalbaseaddressandsizeofthememoryregionandaddsittothememblock.memblock_add
MemblockAPI
LinuxInside
319Memblock
functiondoesnotdoanythingspecialinitsbody,butjustcalls:
memblock_add_range(&memblock.memory,base,size,MAX_NUMNODES,0);
function.Wepassmemoryblocktype-memory,physicalbaseaddressandsizeofthememoryregion,maximumnumberofnodeswhicharezeroifCONFIG_NODES_SHIFTisnotsetintheconfigurationfileorCONFIG_NODES_SHIFTifitisset,andflags.Thememblock_add_rangefunctionaddsnewmemoryregiontothememoryblock.Itstartsbycheckingthesizeofthegivenregionandifitiszeroitjustreturns.Afterthis,memblock_add_rangechecksforexistenceofthememoryregionsinthememblockstructurewiththegivenmemblock_type.Iftherearenomemoryregions,wejustfillnewmemory_regionwiththegivenvaluesandreturn(wealreadysawtheimplementationofthisintheFirsttouchofthelinuxkernelmemorymanagerframework).Ifmemblock_typeisnotempty,westarttoaddnewmemoryregiontothememblockwiththegivenmemblock_type.
Firstofallwegettheendofthememoryregionwiththe:
phys_addr_tend=base+memblock_cap_size(base,&size);
memblock_cap_sizeadjustssizethatbase+sizewillnotoverflow.Itsimplementationisprettyeasy:
staticinlinephys_addr_tmemblock_cap_size(phys_addr_tbase,phys_addr_t*size)
{
return*size=min(*size,(phys_addr_t)ULLONG_MAX-base);
}
memblock_cap_sizereturnsnewsizewhichisthesmallestvaluebetweenthegivensizeandULLONG_MAX-base.
Afterthatwehavetheendaddressofthenewmemoryregion,memblock_add_rangechecksoverlapandmergeconditionswithalreadyaddedmemoryregions.Insertionofthenewmemoryregiontothememblcokconsistsoftwosteps:
Addingofnon-overlappingpartsofthenewmemoryareaasseparateregions;Mergingofallneighbouringregions.
Wearegoingthroughallthealreadystoredmemoryregionsandcheckingforoverlapwiththenewregion:
for(i=0;i<type->cnt;i++){
structmemblock_region*rgn=&type->regions[i];
phys_addr_trbase=rgn->base;
phys_addr_trend=rbase+rgn->size;
if(rbase>=end)
break;
if(rend<=base)
continue;
...
...
...
}
Ifthenewmemoryregiondoesnotoverlapregionswhicharealreadystoredinthememblock,insertthisregionintothememblockwithandthisisfirststep,wecheckthatnewregioncanfitintothememoryblockandcallmemblock_double_arrayinotherway:
while(type->cnt+nr_new>type->max)
if(memblock_double_array(type,obase,size)<0)
LinuxInside
320Memblock
return-ENOMEM;
insert=true;
gotorepeat;
memblock_double_arraydoublesthesizeofthegivenregionsarray.Thenwesetinserttotrueandgototherepeatlabel.Inthesecondstep,startingfromtherepeatlabelwegothroughthesameloopandinsertthecurrentmemoryregionintothememoryblockwiththememblock_insert_regionfunction:
if(base<end){
nr_new++;
if(insert)
memblock_insert_region(type,i,base,end-base,
nid,flags);
}
Aswesetinserttotrueinthefirststep,nowmemblock_insert_regionwillbecalled.memblock_insert_regionhasalmostthesameimplementationthatwesawwhenweinsertnewregiontotheemptymemblock_type(seeabove).Thisfunctiongetsthelastmemoryregion:
structmemblock_region*rgn=&type->regions[idx];
andcopiesmemoryareawithmemmove:
memmove(rgn+1,rgn,(type->cnt-idx)*sizeof(*rgn));
Afterthisfillsmemblock_regionfieldsofthenewmemoryregionbase,sizeandetc...andincreasesizeofthememblock_type.Intheendoftheexecution,memblock_add_rangecallsmemblock_merge_regionswhichmergesneighboringcompatibleregionsinthesecondstep.
Inthesecondcasethenewmemoryregioncanoverlapalreadystoredregions.Forexamplewealreadyhaveregion1inthememblock:
00x1000
+-----------------------+
||
||
|region1|
||
||
+-----------------------+
Andnowwewanttoaddregion2tothememblockwiththefollowingbaseaddressandsize:
0x1000x2000
+-----------------------+
||
||
|region2|
||
||
+-----------------------+
Inthiscasesetthebaseaddressofthenewmemoryregionastheendaddressoftheoverlappedregionwith:
LinuxInside
321Memblock
base=min(rend,end);
Soitwillbe0x1000inourcase.Andinsertitaswediditalreadyinthesecondstepwith:
if(base<end){
nr_new++;
if(insert)
memblock_insert_region(type,i,base,end-base,nid,flags);
}
Inthiscaseweinsertoverlappingportion(weinsertonlythehigherportion,becausethelowerportionisalreadyintheoverlappedmemoryregion),thentheremainingportionandmergetheseportionswithmemblock_merge_regions.AsIsaidabovememblock_merge_regionsfunctionmergesneighboringcompatibleregions.Itgoesthroughtheallmemoryregionsfromthegivenmemblock_type,takestwoneighboringmemoryregions-type->regions[i]andtype->regions[i+1]andchecksthattheseregionshavethesameflags,belongtothesamenodeandthatendaddressofthefirstregionsisnotequaltothebaseaddressofthesecondregion:
while(i<type->cnt-1){
structmemblock_region*this=&type->regions[i];
structmemblock_region*next=&type->regions[i+1];
if(this->base+this->size!=next->base||
memblock_get_region_node(this)!=
memblock_get_region_node(next)||
this->flags!=next->flags){
BUG_ON(this->base+this->size>next->base);
i++;
continue;
}
Ifnoneoftheseconditionsarenottrue,weupdatethesizeofthefirstregionwiththesizeofthenextregion:
this->size+=next->size;
Asweupdatethesizeofthefirstmemoryregionwiththesizeofthenextmemoryregion,wecopyevery(intheloop)memoryregionwhichisafterthecurrent(this)memoryregiontotheoneindexagowiththememmovefunction:
memmove(next,next+1,(type->cnt-(i+2))*sizeof(*next));
Anddecreasethecountofthememoryregionswhicharebelongstothememblock_type:
type->cnt--;
Afterthiswewillgettwomemoryregionsmergedintoone:
00x2000
+------------------------------------------------+
||
||
|region1|
||
||
+------------------------------------------------+
LinuxInside
322Memblock
That'sall.Thisisthewholeprincipleoftheworkofthememblock_add_rangefunction.
Thereisalsomemblock_reservefunctionwhichdoesthesameasmemblock_add,butonlywithonedifference.Itstoresmemblock_type.reservedinthememblockinsteadofmemblock_type.memory.
OfcoursethisisnotthefullAPI.MemblockprovidesanAPIfornotonlyaddingmemoryandreservedmemoryregions,butalso:
memblock_remove-removesmemoryregionfrommemblock;memblock_find_in_range-findsfreeareaingivenrange;memblock_free-releasesmemoryregioninmemblock;for_each_mem_range-iteratesthroughmemblockareas.
andmanymore....
MemblockalsoprovidesanAPIforgettinginformationaboutallocatedmemoryregionsinthememblcok.Itissplitintwoparts:
get_allocated_memblock_memory_regions_info-gettinginfoaboutmemoryregions;get_allocated_memblock_reserved_regions_info-gettinginfoaboutreservedregions.
Implementationofthesefunctionsiseasy.Let'slookatget_allocated_memblock_reserved_regions_infoforexample:
phys_addr_t__init_memblockget_allocated_memblock_reserved_regions_info(
phys_addr_t*addr)
{
if(memblock.reserved.regions==memblock_reserved_init_regions)
return0;
*addr=__pa(memblock.reserved.regions);
returnPAGE_ALIGN(sizeof(structmemblock_region)*
memblock.reserved.max);
}
Firstofallthisfunctionchecksthatmemblockcontainsreservedmemoryregions.Ifmemblockdoesnotcontainreservedmemoryregionswejustreturnzero.Otherwisewewritethephysicaladdressofthereservedmemoryregionsarraytothegivenaddressandreturnalignedsizeoftheallocatedarray.NotethatthereisPAGE_ALIGNmacrousedforalign.Actuallyitdependsonsizeofpage:
#definePAGE_ALIGN(addr)ALIGN(addr,PAGE_SIZE)
Implementationoftheget_allocated_memblock_memory_regions_infofunctionisthesame.Ithasonlyonedifference,memblock_type.memoryusedinsteadofmemblock_type.reserved.
Therearemanycallstomemblock_dbginthememblockimplementation.Ifyoupassthememblock=debugoptiontothekernelcommandline,thisfunctionwillbecalled.Actuallymemblock_dbgisjustamacrowhichexpandstoprintk:
#definememblock_dbg(fmt,...)\
Gettinginfoaboutmemoryregions
Memblockdebugging
LinuxInside
323Memblock
if(memblock_debug)printk(KERN_INFOpr_fmt(fmt),##__VA_ARGS__)
Forexampleyoucanseeacallofthismacrointhememblock_reservefunction:
memblock_dbg("memblock_reserve:[%#016llx-%#016llx]flags%#02lx%pF\n",
(unsignedlonglong)base,
(unsignedlonglong)base+size-1,
flags,(void*)_RET_IP_);
Andyouwillseesomethinglikethis:
Memblockhasalsosupportindebugfs.IfyourunkernelnotinX86architectureyoucanaccess:
/sys/kernel/debug/memblock/memory/sys/kernel/debug/memblock/reserved/sys/kernel/debug/memblock/physmem
forgettingdumpofthememblockcontents.
Thisistheendofthefirstpartaboutlinuxkernelmemorymanagement.Ifyouhavequestionsorsuggestions,pingmeontwitter0xAX,dropmeanemailorjustcreateanissue.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmeaPRtolinux-internals.
e820numadebugfsFirsttouchofthelinuxkernelmemorymanagerframework
Conclusion
Links
LinuxInside
324Memblock
Fix-Mappedaddressesareasetofspecialcompile-timeaddresseswhosecorrespondingphysicaladdressdonothavetobealinearaddressminus__START_KERNEL_map.Eachfix-mappedaddressmapsonepageframeandthekernelusesthemaspointersthatneverchangetheiraddress.Thatisthemainpointoftheseaddresses.Asthecommentsays:tohaveaconstantaddressatcompiletime,buttosetthephysicaladdressonlyinthebootprocess.Youcanrememberthatintheearliestpart,wealreadysetthelevel2_fixmap_pgt:
NEXT_PAGE(level2_fixmap_pgt)
.fill506,8,0
.quadlevel1_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE
.fill5,8,0
NEXT_PAGE(level1_fixmap_pgt)
.fill512,8,0
Asyoucanseelevel2_fixmap_pgtisrightafterthelevel2_kernel_pgtwhichiskernelcode+data+bss.Everyfix-mappedaddressisrepresentedbyanintegerindexwhichisdefinedinthefixed_addressesenumfromthearch/x86/include/asm/fixmap.h.ForexampleitcontainsentriesforVSYSCALL_PAGE-ifemulationoflegacyvsyscallpageisenabled,FIX_APIC_BASEforlocalapicandetc...Inavirtualmemoryfix-mappedareaisplacedinthemodulesarea:
+-----------+-----------------+---------------+------------------+
|||||
|kerneltext|kernel||vsyscalls|
|mapping|text|Modules|fix-mapped|
|fromphys0|data||addresses|
|||||
+-----------+-----------------+---------------+------------------+
__START_KERNEL_map__START_KERNELMODULES_VADDR0xffffffffffffffff
Basevirtualaddressandsizeofthefix-mappedareaarepresentedbythetwofollowingmacro:
#defineFIXADDR_SIZE(__end_of_permanent_fixed_addresses<<PAGE_SHIFT)
#defineFIXADDR_START(FIXADDR_TOP-FIXADDR_SIZE)
Here__end_of_permanent_fixed_addressesisanelementofthefixed_addressesenumandasIwroteabove:Everyfix-mappedaddressisrepresentedbyanintegerindexwhichisdefinedinthefixed_addresses.PAGE_SHIFTdeterminessizeofapage.Forexamplesizeoftheonepagewecangetwiththe1<<PAGE_SHIFT.Inourcaseweneedtogetthesizeofthefix-mappedarea,butnotonlyofonepage,that'swhyweareusing__end_of_permanent_fixed_addressesforgettingthesizeofthefix-mappedarea.Inmycaseit'salittlemorethan536killobytes.Inyourcaseitmightbeadifferentnumber,becausethesizedependsonamountofthefix-mappedaddresseswhicharedependsonyourkernel'sconfiguration.
ThesecondFIXADDR_STARTmacrojustextractsfromthelastaddressofthefix-mappedareaitssizeforgettingbasevirtualaddressofthefix-mappedarea.FIXADDR_TOPisroundedupaddressfromthebaseaddressofthevsyscallspace:
#defineFIXADDR_TOP(round_up(VSYSCALL_ADDR+PAGE_SIZE,1<<PMD_SHIFT)-PAGE_SIZE)
Thefixed_addressesenumsareusedasanindextogetthevirtualaddressusingthefix_to_virtfunction.
LinuxkernelmemorymanagementPart2.
Fix-MappedAddressesandioremap
LinuxInside
325Fixmapsandioremap
Implementationofthisfunctioniseasy:
static__always_inlineunsignedlongfix_to_virt(constunsignedintidx)
{
BUILD_BUG_ON(idx>=__end_of_fixed_addresses);
return__fix_to_virt(idx);
}
firstofallitchecksthattheindexgivenforthefixed_addressesenumisnotgreaterorequalthan__end_of_fixed_addresseswiththeBUILD_BUG_ONmacroandthenreturnstheresultofthe__fix_to_virtmacro:
#define__fix_to_virt(x)(FIXADDR_TOP-((x)<<PAGE_SHIFT))
Hereweshiftleftthegivenfix-mappedaddressindexonthePAGE_SHIFTwhichdeterminessizeofapageasIwroteaboveandsubtractitfromtheFIXADDR_TOPwhichisthehighestaddressofthefix-mappedarea.Thereisaninversefunctionforgettingfix-mappedaddressfromavirtualaddress:
staticinlineunsignedlongvirt_to_fix(constunsignedlongvaddr)
{
BUG_ON(vaddr>=FIXADDR_TOP||vaddr<FIXADDR_START);
return__virt_to_fix(vaddr);
}
virt_to_fixtakesvirtualaddress,checksthatthisaddressisbetweenFIXADDR_STARTandFIXADDR_TOPandcalls__virt_to_fixmacrowhichimplementedas:
#define__virt_to_fix(x)((FIXADDR_TOP-((x)&PAGE_MASK))>>PAGE_SHIFT)
APFNissimplyanindexwithinphysicalmemorythatiscountedinpage-sizedunits.PFNforaphysicaladdresscouldbetriviallydefinedas(page_phys_addr>>PAGE_SHIFT);
__virt_to_fixclearsthefirst12bitsinthegivenaddress,subtractsitfromthelastaddresstheoffix-mappedarea(FIXADDR_TOP)andshiftsrightresultonPAGE_SHIFTwhichis12.Letmeexplainhowitworks.AsIalreadywrotewewillclearthefirst12bitsinthegivenaddresswithx&PAGE_MASK.AswesubtractthisfromtheFIXADDR_TOP,wewillgetthelast12bitsoftheFIXADDR_TOPwhicharepresent.Weknowthatthefirst12bitsofthevirtualaddressrepresenttheoffsetinthepageframe.WiththeshitingitonPAGE_SHIFTwewillgetPageframenumberwhichisjustallbitsinavirtualaddressbesidesthefirst12offsetbits.Fix-mappedaddressesareusedindifferentplacesinthelinuxkernel.IDTdescriptorstoredthere,IntelTrustedExecutionTechnologyUUIDstoredinthefix-mappedareastartedfromFIX_TBOOT_BASEindex,Xenbootmapandmanymore...Wealreadysawalittleaboutfix-mappedaddressesinthefifthpartaboutlinuxkernelinitialization.Weusedfix-mappedareaintheearlyioremapinitialization.Let'slookonitandtrytounderstandwhatisitioremap,howitisimplementedinthekernelandhowitisreleatedtothefix-mappedaddresses.
Linuxkernelprovidesmanydifferentprimitivestomanagememory.ForthismomentwewilltouchI/Omemory.Everydeviceiscontrolledbyreading/writingfrom/toitsregisters.Forexampleadrivercanturnoff/onadevicebywritingtoitsregistersorgetthestateofadevicebyreadingfromitsregisters.Besidesregisters,manydeviceshavebufferswhereadrivercanwritesomethingorreadfromthere.Asweknowforthismomenttherearetwowaystoaccessdevice'sregistersanddatabuffers:
throughtheI/Oports;
ioremap
LinuxInside
326Fixmapsandioremap
mappingoftheallregisterstothememoryaddressspace;
Inthefirstcaseeverycontrolregisterofadevicehasanumberofinputandoutputport.Anddriverofadevicecanreadfromaportandwritetoitwithtwoinandoutinstructionswhichwealreadysaw.Ifyouwanttoknowaboutcurrentlyregisteredportregions,youcanknowtheybyaccessingof/proc/ioports:
$cat/proc/ioports
0000-0cf7:PCIBus0000:00
0000-001f:dma1
0020-0021:pic1
0040-0043:timer0
0050-0053:timer1
0060-0060:keyboard
0064-0064:keyboard
0070-0077:rtc0
0080-008f:dmapagereg
00a0-00a1:pic2
00c0-00df:dma2
00f0-00ff:fpu
00f0-00f0:PNP0C04:00
03c0-03df:vesafb
03f8-03ff:serial
04d0-04d1:pnp00:06
0800-087f:pnp00:01
0a00-0a0f:pnp00:04
0a20-0a2f:pnp00:04
0a30-0a3f:pnp00:04
0cf8-0cff:PCIconf1
0d00-ffff:PCIBus0000:00
...
...
...
/proc/ioporstprovidesinformationaboutwhatdriverusedaddressofaI/Oportsregion.Allofthesememoryregions,forexample0000-0cf7,wereclaimedwiththerequest_regionfunctionfromtheinclude/linux/ioport.h.Actuallyrequest_regionisamacrowhichdefiedas:
#definerequest_region(start,n,name)__request_region(&ioport_resource,(start),(n),(name),0)
Aswecanseeittakesthreeparameters:
start-beginofregion;n-lengthofregion;name-nameofrequester.
request_regionallocatesI/Oportregion.Veryoftencheck_regionfunctioncalledbeforetherequest_regiontocheckthatthegivenaddressrangeisavailableandrelease_regiontoreleasememoryregion.request_regionreturnspointertotheresourcestructure.resourcestructurepresentsabstractionforatree-likesubsetofsystemresources.Wealreadysawresourcestructureinthefirthpartaboutkernelinitializationprocessanditlooksas:
structresource{
resource_size_tstart;
resource_size_tend;
constchar*name;
unsignedlongflags;
structresource*parent,*sibling,*child;
};
andcontainsstartandendaddressesoftheresource,nameandetc...Everyresourcestructurecontainspointerstotheparent,sliblingandchildresources.Asithasparentandchilds,itmeansthateverysubsetofresuorceshasroot
LinuxInside
327Fixmapsandioremap
resourcestructure.Forexample,forI/Oportsitisioport_resourcestructure:
structresourceioport_resource={
.name="PCIIO",
.start=0,
.end=IO_SPACE_LIMIT,
.flags=IORESOURCE_IO,
};
EXPORT_SYMBOL(ioport_resource);
Orforiomem,itisiomem_resourcestructure:
structresourceiomem_resource={
.name="PCImem",
.start=0,
.end=-1,
.flags=IORESOURCE_MEM,
};
AsIwroteaboutrequest_regionsisusedforregisteringofI/Oportregionandthismacrousedinmanyplacesinthekernel.Forexamplelet'slookatdrivers/char/rtc.c.ThissourcecodefileprovidesRealTimeClockinterfaceinthelinuxkernel.Aseverykernelmodule,rtcmodulecontainsmodule_initdefinition:
module_init(rtc_init);
wherertc_initisrtcinitializationfunction.Thisfunctiondefinedinthesamertc.csourcecodefile.Inthertc_initfunctionwecanseeacouplecallsofthertc_request_regionfunctions,whichwraprequest_regionforexample:
r=rtc_request_region(RTC_IO_EXTENT);
wherertc_request_regioncalls:
r=request_region(RTC_PORT(0),size,"rtc");
HereRTC_IO_EXTENTisasizeofmemoryregionanditis0x8,"rtc"isanameofregionandRTC_PORTis:
#defineRTC_PORT(x)(0x70+(x))
Sowiththerequest_region(RTC_PORT(0),size,"rtc")weregistermemoryregion,startedat0x70andwithsize0x8.Let'slookonthe/proc/ioports:
~$sudocat/proc/ioports|greprtc
0070-0077:rtc0
So,wegotit!Ok,itwasports.ThesecondwayisuseofI/Omemory.AsIwroteabovethiswayismappingofcontrolregistersandmemoryofadevicetothememoryaddressspace.I/OmemoryisasetofcontiguousaddresseswhichareprovidedbyadevicetoCPUthroughabus.Allmemory-mappedI/Oaddressesarenotusedbythekerneldirectly.ThereisaspecialioremapfunctionwhichallowsustocovertthephysicaladdressonabustothekernelvirtualaddressorinanotherwordsioremapmapsI/Ophysicalmemoryregiontoaccessitfromthekernel.Theioremapfunctiontakestwo
LinuxInside
328Fixmapsandioremap
parameters:
startofthememoryregion;sizeofthememoryregion;
I/OmemorymappingAPIprovidesfunctionforthechecking,requestingandreleaseofamemoryregionasthisdoesI/OportsAPI.Therearethreefunctionsforit:
request_mem_region
release_mem_region
check_mem_region
~$sudocat/proc/iomem
...
...
...
be826000-be82cfff:ACPINon-volatileStorage
be82d000-bf744fff:SystemRAM
bf745000-bfff4fff:reserved
bfff5000-dc041fff:SystemRAM
dc042000-dc0d2fff:reserved
dc0d3000-dc138fff:SystemRAM
dc139000-dc27dfff:ACPINon-volatileStorage
dc27e000-deffefff:reserved
defff000-deffffff:SystemRAM
df000000-dfffffff:RAMbuffer
e0000000-feafffff:PCIBus0000:00
e0000000-efffffff:PCIBus0000:01
e0000000-efffffff:0000:01:00.0
f7c00000-f7cfffff:PCIBus0000:06
f7c00000-f7c0ffff:0000:06:00.0
f7c10000-f7c101ff:0000:06:00.0
f7c10000-f7c101ff:ahci
f7d00000-f7dfffff:PCIBus0000:03
f7d00000-f7d3ffff:0000:03:00.0
f7d00000-f7d3ffff:alx
...
...
...
Partoftheseaddressesisfromthecallofthee820_reserve_resourcesfunction.Wecanfindcallofthisfunctioninthearch/x86/kernel/setup.candthefunctionitselfdefinedinthearch/x86/kernel/e820.c.e820_reserve_resourcesgoesthroughthee820mapandinsertsmemoryregionstotherootiomemresourcestructure.Alle820memoryregionswhicharewillbeinsertedtotheiomemresourcewillhavefollowingtypes:
staticinlineconstchar*e820_type_to_string(inte820_type)
{
switch(e820_type){
caseE820_RESERVED_KERN:
caseE820_RAM:return"SystemRAM";
caseE820_ACPI:return"ACPITables";
caseE820_NVS:return"ACPINon-volatileStorage";
caseE820_UNUSABLE:return"Unusablememory";
default:return"reserved";
}
}
andwecanseeitinthe/proc/iomem(readabove).
Nowlet'strytounderstandhowioremapworks.Wealreadyknowalittleaboutioremap,wesawitinthefifthpartaboutlinuxkernelinitialization.Ifyouhavereadthispart,youcanrememberthecalloftheearly_ioremap_initfunctionfromthearch/x86/mm/ioremap.c.Initializationoftheioremapissplitinntwoparts:thereistheearlypartwhichwecanusebeforethenormalioremapisavailableandthenormalioremapwhichisavailableaftervmallocinitializationandcallofthe
LinuxInside
329Fixmapsandioremap
paging_init.Wedonotknowanythingaboutvmallocfornow,solet'sconsiderearlyinitializationoftheioremap.Firstofallearly_ioremap_initchecksthatfixmapisalignedonpagemiddledirectoryboundary:
BUILD_BUG_ON((fix_to_virt(0)+PAGE_SIZE)&((1<<PMD_SHIFT)-1));
moreaboutBUILD_BUG_ONyoucanreadinthefirstpartaboutLinuxKernelinitialization.SoBUILD_BUG_ONmacroraisescompilationerrorifthegivenexpressionistrue.Inthenextstepafterthischeck,wecanseecalloftheearly_ioremap_setupfunctionfromthemm/early_ioremap.c.Thisfunctionpresentsgenericinitializationoftheioremap.early_ioremap_setupfunctionfillstheslot_virtarraywiththevirtualaddressesoftheearlyfixmaps.Allearlyfixmapsareafter__end_of_permanent_fixed_addressesinmemory.TheyarestatsfromtheFIX_BITMAP_BEGIN(top)andendswithFIX_BITMAP_END(down).Actuallythereare512temporaryboot-timemappings,usedbyearlyioremap:
#defineNR_FIX_BTMAPS64
#defineFIX_BTMAPS_SLOTS8
#defineTOTAL_FIX_BTMAPS(NR_FIX_BTMAPS*FIX_BTMAPS_SLOTS)
andearly_ioremap_setup:
void__initearly_ioremap_setup(void)
{
inti;
for(i=0;i<FIX_BTMAPS_SLOTS;i++)
if(WARN_ON(prev_map[i]))
break;
for(i=0;i<FIX_BTMAPS_SLOTS;i++)
slot_virt[i]=__fix_to_virt(FIX_BTMAP_BEGIN-NR_FIX_BTMAPS*i);
}
theslot_virtandotherarraysaredefinedinthesamesourcecodefile:
staticvoid__iomem*prev_map[FIX_BTMAPS_SLOTS]__initdata;
staticunsignedlongprev_size[FIX_BTMAPS_SLOTS]__initdata;
staticunsignedlongslot_virt[FIX_BTMAPS_SLOTS]__initdata;
slot_virtcontainsvirtualaddressesofthefix-mappedareas,prev_maparraycontainsaddressesoftheearlyioremapareas.NotethatIwroteabove:Actuallythereare512temporaryboot-timemappings,usedbyearlyioremapandyoucanseethatallarraysdefinedwiththe__initdataattributewhichmeansthatthismemorywillbereleasedafterkernelinitializationprocess.Afterearly_ioremap_setupfinishedtowork,we'regettingpagemiddledirectorywhereearlyioremapbeginningwiththeearly_ioremap_pmdfunctionwhichjustgetsthebaseaddressofthepageglobaldirectoryandcalculatesthepagemiddledirectoryforthegivenaddress:
staticinlinepmd_t*__initearly_ioremap_pmd(unsignedlongaddr)
{
pgd_t*base=__va(read_cr3());
pgd_t*pgd=&base[pgd_index(addr)];
pud_t*pud=pud_offset(pgd,addr);
pmd_t*pmd=pmd_offset(pud,addr);
returnpmd;
}
Afterthiswefillsbm_pte(earlyioremappagetableentries)withzerosandcallthepmd_populate_kernelfunction:
LinuxInside
330Fixmapsandioremap
pmd=early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte,0,sizeof(bm_pte));
pmd_populate_kernel(&init_mm,pmd,bm_pte);
pmd_populate_kerneltakesthreeparameters:
init_mm-memorydescriptoroftheinitprocess(youcanreadaboutitinthepreviouspart);pmd-pagemiddledirectoryofthebeginningoftheioremapfixmaps;bm_pte-earlyioremappagetableentriesarraywhichdefinedas:
staticpte_tbm_pte[PAGE_SIZE/sizeof(pte_t)]__page_aligned_bss;
Thepmd_popularte_kernelfunctiondefinedinthearch/x86/include/asm/pgalloc.handpopulatesgivenpagemiddledirectory(pmd)withthegivenpagetableentries(bm_pte):
staticinlinevoidpmd_populate_kernel(structmm_struct*mm,
pmd_t*pmd,pte_t*pte)
{
paravirt_alloc_pte(mm,__pa(pte)>>PAGE_SHIFT);
set_pmd(pmd,__pmd(__pa(pte)|_PAGE_TABLE));
}
whereset_pmdis:
#defineset_pmd(pmdp,pmd)native_set_pmd(pmdp,pmd)
andnative_set_pmdis:
staticinlinevoidnative_set_pmd(pmd_t*pmdp,pmd_tpmd)
{
*pmdp=pmd;
}
That'sall.Earlyioremapisreadytouse.Thereareacoupleofchecksintheearly_ioremap_initfunction,buttheyarenotsoimportant,anywayinitializationoftheioremapisfinished.
Asearlyioremapissetup,wecanuseit.Itprovidestwofunctions:
early_ioremapearly_iounmap
formapping/unmappingofIOphysicaladdresstovirtualaddress.BothfunctionsdependsonCONFIG_MMUconfigurationoption.Memorymanagementunitisaspecialblockofmemorymanagement.Mainpurposeofthisblockistranslationphysicaladdressestothevirtual.Techinicallymemorymanagementunitknowsabouthigh-levelpagetableaddress(pgd)fromthecr3controlregister.IfCONFIG_MMUoptionsissetton,early_ioremapjustreturnsthegivenphysicaladdressandearly_iounmapdoesnotnothing.Inotherway,ifCONFIG_MMUoptionissettoy,early_ioremapcalls__early_ioremapwhichtakesthreeparameters:
phys_addr-basephysicalladdressoftheI/Omemoryregiontomaponvirtualaddresses;
Useofearlyioremap
LinuxInside
331Fixmapsandioremap
size-sizeoftheI/Omemroyregion;prot-pagetableentrybits.
Firstofallinthe__early_ioremap,wegoesthroughtheallearlyioremapfixmapslotsandcheckfirstfreeareintheprev_maparrayandrememberit'snumberintheslotvariableandsetupsizeaswefoundit:
slot=-1;
for(i=0;i<FIX_BTMAPS_SLOTS;i++){
if(!prev_map[i]){
slot=i;
break;
}
}
...
...
...
prev_size[slot]=size;
last_addr=phys_addr+size-1;
Inthenextsptewecanseethefollowingcode:
offset=phys_addr&~PAGE_MASK;
phys_addr&=PAGE_MASK;
size=PAGE_ALIGN(last_addr+1)-phys_addr;
HereweareusingPAGE_MASKforclearingallbitsinthephys_addrbesidesfirst12bits.PAGE_MASKmacrodefinedas:
#definePAGE_MASK(~(PAGE_SIZE-1))
Weknowthatsizeofapageis4096bytesor1000000000000inbinary.PAGE_SIZE-1willbe111111111111,butwith~,wewillget000000000000,butasweuse~PAGE_MASKwewillget111111111111again.Onthesecondlinewedothesamebutclearfirst12bitsandgettingpage-alignedsizeoftheareaonthethirdline.Wegettingalignedareaandnowweneedtogetthenumberofpageswhichareoccupiedbythenewioremapareandcalculatethefix-mappedindexfromfixed_addressesinthenextsteps:
nrpages=size>>PAGE_SHIFT;
idx=FIX_BTMAP_BEGIN-NR_FIX_BTMAPS*slot;
Nowwecanfillfix-mappedareawiththegivenphysicaladdresses.Everyiterationintheloop,wecall__early_set_fixmapfunctionfromthearch/x86/mm/ioremap.c,increasegivenphysicaladdressonpagesizewhichis4096bytesandupdateaddressesindexandnumberofpages:
while(nrpages>0){
__early_set_fixmap(idx,phys_addr,prot);
phys_addr+=PAGE_SIZE;
--idx;
--nrpages;
}
The__early_set_fixmapfunctiongetsthepagetableentry(storedinthebm_pte,seeabove)forthegivenphysicaladdresswith:
pte=early_ioremap_pte(addr);
LinuxInside
332Fixmapsandioremap
Inthenextstepoftheearly_ioremap_ptewecheckthegivenpageflagswiththepgprot_valmacroandcallsset_pteorpte_cleardependsonit:
if(pgprot_val(flags))
set_pte(pte,pfn_pte(phys>>PAGE_SHIFT,flags));
else
pte_clear(&init_mm,addr,pte);
Asyoucanseeabove,wepassedFIXMAP_PAGE_IOasflagstothe__early_ioremap.FIXMPA_PAGE_IOexpandstothe:
(__PAGE_KERNEL_EXEC|_PAGE_NX)
flags,sowecallset_ptefunctionforsettingpagetableentrywhichworksinthesamemannerasset_pmdbutforPTEs(readaboveaboutit).AswesetallPTEsintheloop,wecanseethecallofthe__flush_tlb_onefunction:
__flush_tlb_one(addr);
Thisfunctiondefinedinthearch/x86/include/asm/tlbflush.handcalls__flush_tlb_singleor__flush_tlbdependsonvalueofthecpu_has_invlpg:
staticinlinevoid__flush_tlb_one(unsignedlongaddr)
{
if(cpu_has_invlpg)
__flush_tlb_single(addr);
else
__flush_tlb();
}
__flush_tlb_onefunctioninvalidatesgivenaddressintheTLB.Asyoujustsawweupdatedpagingstructure,butTLBnotinformedofchanges,that'swhyweneedtodoitmanually.Therearetwowayshowtodoit.Firstisupdatecr3controlregisterand__flush_tlbfunctiondoesthis:
native_write_cr3(native_read_cr3());
ThesecondmethodistouseinvlpginstructioninvalidatesTLBentry.Let'slookon__flush_tlb_oneimplementation.Asyoucanseefirstofallitcheckscpu_has_invlpgwhichdefinedas:
#ifdefined(CONFIG_X86_INVLPG)||defined(CONFIG_X86_64)
#definecpu_has_invlpg1
#else
#definecpu_has_invlpg(boot_cpu_data.x86>3)
#endif
IfaCPUsupportinvlpginstruction,wecallthe__flush_tlb_singlemacrowhichexpandstothecallofthe__native_flush_tlb_single:
staticinlinevoid__native_flush_tlb_single(unsignedlongaddr)
{
asmvolatile("invlpg(%0)"::"r"(addr):"memory");
}
LinuxInside
333Fixmapsandioremap
orcall__flush_tlbwhichjustupdatescr3registeraswesawitabove.Afterthisstepexecutionofthe__early_set_fixmapfunctionisfinsihedandwecanbacktothe__early_ioremapimplementation.Aswesetfixmapareaforthegivenaddress,needtosavethebasevirtualaddressoftheI/ORe-mappedareaintheprev_mapwiththeslotindex:
prev_map[slot]=(void__iomem*)(offset+slot_virt[slot]);
andreturnit.
Thesecondfunctionis-early_iounmap-unmapsanI/Omemoryregion.Thisfunctiontakestwoparameters:baseaddressandsizeofaI/Oregionandgenerallylooksverysimilaronearly_ioremap.Italsogoesthroughfixmapslotsandlooksforslotwiththegivenaddress.Afterthisitgetstheindexofthefixmapslotandcalls__late_clear_fixmapor__early_set_fixmapdependsonafter_paging_initvalue.Itcalls__early_set_fixmapwithondifferencethenitdoesearly_ioremap:itpasseszeroasphysicalladdress.AndintheenditsetsaddressoftheI/OmemoryregiontoNULL:
prev_map[slot]=NULL;
That'sallaboutfixmapsandioremap.Ofcoursethispartdoesnotcoverfullfeaturesoftheioremap,itwasonlyearlyioremap,butthereisalsonormalioremap.Butweneedtoknowmorethingsthannowbeforeit.
So,thisistheend!
Thisistheendofthesecondpartaboutlinuxkernelmemorymanagement.Ifyouhavequestionsorsuggestions,pingmeontwitter0xAX,dropmeanemailorjustcreateanissue.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmeaPRtolinux-internals.
apicvsyscallIntelTrustedExecutionTechnologyXenRealTimeClocke820MemorymanagementunitTLBPagingLinuxkernelmemorymanagementPart1.
Conclusion
Links
LinuxInside
334Fixmapsandioremap
ThischapterdescribesvariousconceptswhichareusedintheLinuxkernel.
Per-CPUvariablesCPUmasks
Linuxkernelconcepts
LinuxInside
335Concepts
Per-CPUvariablesareoneofthekernelfeatures.Youcanunderstandwhatthisfeaturemeansbyreadingitsname.Wecancreateavariableandeachprocessorcorewillhaveitsowncopyofthisvariable.Inthispart,wetakeacloserlookatthisfeatureandtrytounderstandhowitisimplementedandhowitworks.
ThekernelprovidesanAPIforcreatingper-cpuvariables-theDEFINE_PER_CPUmacro:
#defineDEFINE_PER_CPU(type,name)\
DEFINE_PER_CPU_SECTION(type,name,"")
Thismacrodefinedintheinclude/linux/percpu-defs.hasmanyothermacrosforworkwithper-cpuvariables.Nowwewillseehowthisfeatureisimplemented.
TakealookattheDECLARE_PER_CPUdefinition.Weseethatittakes2parameters:typeandname,sowecanuseittocreateper-cpuvariables,forexamplelikethis:
DEFINE_PER_CPU(int,per_cpu_n)
Wepassthetypeandthenameofourvariable.DEFINE_PER_CPUcallstheDEFINE_PER_CPU_SECTIONmacroandpassesthesametwoparamatersandemptystringtoit.Let'slookatthedefinitionoftheDEFINE_PER_CPU_SECTION:
#defineDEFINE_PER_CPU_SECTION(type,name,sec)\
__PCPU_ATTRS(sec)PER_CPU_DEF_ATTRIBUTES\
__typeof__(type)name
#define__PCPU_ATTRS(sec)\
__percpu__attribute__((section(PER_CPU_BASE_SECTIONsec)))\
PER_CPU_ATTRIBUTES
wheresectionis:
#definePER_CPU_BASE_SECTION".data..percpu"
Afterallmacrosareexpandedwewillgetaglobalper-cpuvariable:
__attribute__((section(".data..percpu")))intper_cpu_n
Itmeansthatwewillhaveaper_cpu_nvariableinthe.data..percpusection.Wecanfindthissectioninthevmlinux:
.data..percpu00013a5800000000000000000000000001a5c00000e000002**12
CONTENTS,ALLOC,LOAD,DATA
Ok,nowweknowthatwhenweusetheDEFINE_PER_CPUmacro,aper-cpuvariableinthe.data..percpusectionwillbecreated.Whenthekernelinitializesitcallsthesetup_per_cpu_areasfunctionwhichloadsthe.data..percpusectionmultipletimes,onesectionperCPU.
Per-CPUvariables
LinuxInside
336Per-CPUvariables
Let'slookattheper-CPUareasinitializationprocess.Itstartsintheinit/main.cfromthecallofthesetup_per_cpu_areasfunctionwhichisdefinedinthearch/x86/kernel/setup_percpu.c.
pr_info("NR_CPUS:%dnr_cpumask_bits:%dnr_cpu_ids:%dnr_node_ids:%d\n",
NR_CPUS,nr_cpumask_bits,nr_cpu_ids,nr_node_ids);
Thesetup_per_cpu_areasstartsfromtheoutputinformationaboutthemaximumnumberofCPUssetduringkernelconfigurationwiththeCONFIG_NR_CPUSconfigurationoption,actualnumberofCPUs,nr_cpumask_bitsisthesamethatNR_CPUSbitforthenewcpumaskoperatorsandnumberofNUMAnodes.
Wecanseethisoutputinthedmesg:
$dmesg|greppercpu
[0.000000]setup_percpu:NR_CPUS:8nr_cpumask_bits:8nr_cpu_ids:8nr_node_ids:1
Inthenextstepwecheckthepercpufirstchunkallocator.Allpercpuareasareallocatedinchunks.Thefirstchunkisusedforthestaticpercpuvariables.TheLinuxkernelhaspercpu_alloccommandlineparameterswhichprovidesthetypeofthefirstchunkallocator.Wecanreadaboutitinthekerneldocumentation:
percpu_alloc=Selectwhichpercpufirstchunkallocatortouse.
Currentlysupportedvaluesare"embed"and"page".
Archsmaysupportsubsetornoneoftheselections.
Seecommentsinmm/percpu.cfordetailsoneach
allocator.Thisparameterisprimarilyfordebugging
andperformancecomparison.
Themm/percpu.ccontainsthehandlerofthiscommandlineoption:
early_param("percpu_alloc",percpu_alloc_setup);
Wherethepercpu_alloc_setupfunctionsetsthepcpu_chosen_fcvariabledependsonthepercpu_allocparametervalue.Bydefaultthefirstchunkallocatorisauto:
enumpcpu_fcpcpu_chosen_fc__initdata=PCPU_FC_AUTO;
Ifthepercpu_allocparameterisnotgiventothekernelcommandline,theembedallocatorwillbeusedwhichembedsthefirstpercpuchunkintobootmemwiththememblock.ThelastallocatoristhefirstchunkpageallocatorwhichmapsthefirstchunkwithPAGE_SIZEpages.
AsIwroteaboutfirstofall,wemakeacheckofthefirstchunkallocatortypeinthesetup_per_cpu_areas.Firstofallwecheckthatfirstchunkallocatorisnotpage:
if(pcpu_chosen_fc!=PCPU_FC_PAGE){
...
...
...
}
IfitisnotPCPU_FC_PAGE,wewillusetheembedallocatorandallocatespaceforthefirstchunkwiththepcpu_embed_first_chunkfunction:
LinuxInside
337Per-CPUvariables
rc=pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
dyn_size,atom_size,
pcpu_cpu_distance,
pcpu_fc_alloc,pcpu_fc_free);
AsIwroteabove,thepcpu_embed_first_chunkfunctionembedsthefirstpercpuchunkintobootmem.Asyoucanseewepassacoupleofparameterstothepcup_embed_first_chunk,theyare
PERCPU_FIRST_CHUNK_RESERVE-thesizeofthereservedspaceforthestaticpercpuvariables;dyn_size-minimumfreesizefordynamicallocationinbytes;atom_size-allallocationsarewholemultiplesofthisandalignedtothisparameter;pcpu_cpu_distance-callbacktodeterminedistancebetweencpus;pcpu_fc_alloc-functiontoallocatepercpupage;pcpu_fc_free-functiontoreleasepercpupage.
Alloftheseparameterswecalculatebeforethecallofthepcpu_embed_first_chunk:
constsize_tdyn_size=PERCPU_MODULE_RESERVE+PERCPU_DYNAMIC_RESERVE-PERCPU_FIRST_CHUNK_RESERVE;
size_tatom_size;
#ifdefCONFIG_X86_64
atom_size=PMD_SIZE;
#else
atom_size=PAGE_SIZE;
#endif
IfthefirstchunkallocatorisPCPU_FC_PAGE,wewillusethepcpu_page_first_chunkinsteadofthepcpu_embed_first_chunk.Afterthatpercpuareasup,wesetuppercpuoffsetanditssegmentforeveryCPUwiththesetup_percpu_segmentfunction(onlyforx86systems)andmovesomeearlydatafromthearraystothepercpuvariables(x86_cpu_to_apicid,irq_stack_ptrandetc...).Afterthekernelfinishestheinitializationprocess,wewillhaveloadedN.data..percpusections,whereNisthenumberofCPUs,andthesectionusedbythebootstrapprocessorwillcontainanuninitializedvariablecreatedwiththeDEFINE_PER_CPUmacro.
ThekernelprovidesanAPIforper-cpuvariablesmanipulating:
get_cpu_var(var)put_cpu_var(var)
Let'slookattheget_cpu_varimplementation:
#defineget_cpu_var(var)\
(*({\
preempt_disable();\
this_cpu_ptr(&var);\
}))
TheLinuxkernelispreemptibleandaccessingaper-cpuvariablerequiresustoknowwhichprocessorthekernelrunningon.So,currentcodemustnotbepreemptedandmovedtotheanotherCPUwhileaccessingaper-cpuvariable.That'swhyfirstofallwecanseeacallofthepreempt_disablefunction.Afterthiswecanseeacallofthethis_cpu_ptrmacro,whichlookslike:
#definethis_cpu_ptr(ptr)raw_cpu_ptr(ptr)
and
LinuxInside
338Per-CPUvariables
#defineraw_cpu_ptr(ptr)per_cpu_ptr(ptr,0)
whereper_cpu_ptrreturnsapointertotheper-cpuvariableforthegivencpu(secondparameter).Afterwe'vecreatedaper-cpuvariableandmademodificationstoit,wemustcalltheput_cpu_varmacrowhichenablespreemptionwithacallofpreempt_enablefunction.Sothetypicalusageofaper-cpuvariableisasfollows:
get_cpu_var(var);
...
//Dosomethingwiththe'var'
...
put_cpu_var(var);
Let'slookattheper_cpu_ptrmacro:
#defineper_cpu_ptr(ptr,cpu)\
({\
__verify_pcpu_ptr(ptr);\
SHIFT_PERCPU_PTR((ptr),per_cpu_offset((cpu)));\
})
AsIwroteabove,thismacroreturnsaper-cpuvariableforthegivencpu.Firstofallitcalls__verify_pcpu_ptr:
#define__verify_pcpu_ptr(ptr)
do{
constvoid__percpu*__vpp_verify=(typeof((ptr)+0))NULL;
(void)__vpp_verify;
}while(0)
whichmakesthegivenptrtypeofconstvoid__percpu*,
AfterthiswecanseethecalloftheSHIFT_PERCPU_PTRmacrowithtwoparameters.Atfirstparameterwepassourptrandsecondwepassthecpunumbertotheper_cpu_offsetmacro:
#defineper_cpu_offset(x)(__per_cpu_offset[x])
whichexpandstogettingthexelementfromthe__per_cpu_offsetarray:
externunsignedlong__per_cpu_offset[NR_CPUS];
whereNR_CPUSisthenumberofCPUs.The__per_cpu_offsetarrayisfilledwiththedistancesbetweencpu-variablecopies.Forexampleallper-cpudataisXbytesinsize,soifweaccess__per_cpu_offset[Y],X*Ywillbeaccessed.Let'slookattheSHIFT_PERCPU_PTRimplementation:
#defineSHIFT_PERCPU_PTR(__p,__offset)\
RELOC_HIDE((typeof(*(__p))__kernel__force*)(__p),(__offset))
RELOC_HIDEjustreturnsoffset(typeof(ptr))(__ptr+(off))anditwillreturnapointertothevariable.
That'sall!OfcourseitisnotthefullAPI,butageneraloverview.Itcanbehardtostartwith,buttounderstandper-cpuvariablesyoumainlyneedtounderstandtheinclude/linux/percpu-defs.hmagic.
LinuxInside
339Per-CPUvariables
Let'sagainlookatthealgorithmofgettingapointertoaper-cpuvariable:
Thekernelcreatesmultiple.data..percpusections(oneper-cpu)duringinitializationprocess;AllvariablescreatedwiththeDEFINE_PER_CPUmacrowillberelocatedtothefirstsectionorforCPU0;__per_cpu_offsetarrayfilledwiththedistance(BOOT_PERCPU_OFFSET)between.data..percpusections;Whentheper_cpu_ptriscalled,forexampleforgettingapointeronacertainper-cpuvariableforthethirdCPU,the__per_cpu_offsetarraywillbeaccessed,whereeveryindexpointstotherequiredCPU.
That'sall.
LinuxInside
340Per-CPUvariables
CpumasksisaspecialwayprovidedbytheLinuxkerneltostoreinformationaboutCPUsinthesystem.TherelevantsourcecodeandheaderfileswhicharecontainsAPIforCpumasksmanipulating:
include/linux/cpumask.hlib/cpumask.ckernel/cpu.c
Ascommentsaysfromtheinclude/linux/cpumask.h:CpumasksprovideabitmapsuitableforrepresentingthesetofCPU'sinasystem,onebitpositionperCPUnumber.Wealreadysawabitaboutcpumaskintheboot_cpu_initfunctionfromtheKernelentrypointpart.Thisfunctionmakesfirstbootcpuonline,activeandetc...:
set_cpu_online(cpu,true);
set_cpu_active(cpu,true);
set_cpu_present(cpu,true);
set_cpu_possible(cpu,true);
set_cpu_possibleisasetofcpuID'swhichcanbepluggedinanytimeduringthelifeofthatsystemboot.cpu_presentrepresentswhichCPUsarecurrentlypluggedin.cpu_onlinerepresentsasubsetofthecpu_presentandindicatesCPUswhichareavailableforscheduling.ThesemasksdependontheCONFIG_HOTPLUG_CPUconfigurationoptionandifthisoptionisdisabledpossible==presentandactive==online.Theimplementationsofallofthesefunctionsareverysimilar.Everyfunctionchecksthesecondparameter.Ifitistrue,itcallscpumask_set_cpuotherwiseitcallscpumask_clear_cpu.
Therearetwowaysforacpumaskcreation.Firstistousecpumask_t.Itisdefinedas:
typedefstructcpumask{DECLARE_BITMAP(bits,NR_CPUS);}cpumask_t;
Itwrapsthecpumaskstructurewhichcontainsonebitmakbitsfield.TheDECLARE_BITMAPmacrogetstwoparameters:
bitmapname;numberofbits.
andcreatesanarrayofunsignedlongwiththegivenname.Itsimplementationisprettyeasy:
#defineDECLARE_BITMAP(name,bits)\
unsignedlongname[BITS_TO_LONGS(bits)]
whereBITS_TO_LONGS:
#defineBITS_TO_LONGS(nr)DIV_ROUND_UP(nr,BITS_PER_BYTE*sizeof(long))
#defineDIV_ROUND_UP(n,d)(((n)+(d)-1)/(d))
Aswearefocussingonthex86_64architecture,unsignedlongis8-bytessizeandourarraywillcontainonlyoneelement:
(((8)+(8)-1)/(8))=1
CPUmasks
Introduction
LinuxInside
341Cpumasks
NR_CPUSmacrorepresentsthenumberofCPUsinthesystemanddependsontheCONFIG_NR_CPUSmacrowhichisdefinedininclude/linux/threads.handlookslikethis:
#ifndefCONFIG_NR_CPUS
#defineCONFIG_NR_CPUS1
#endif
#defineNR_CPUSCONFIG_NR_CPUS
ThesecondwaytodefinecpumaskistousetheDECLARE_BITMAPmacrodirectlyandtheto_cpumaskmacrowhichconvertsthegivenbitmaptostructcpumask*:
#defineto_cpumask(bitmap)\
((structcpumask*)(1?(bitmap)\
:(void*)sizeof(__check_is_bitmap(bitmap))))
Wecanseetheternaryoperatoroperatorherewhichistrueeverytime.__check_is_bitmapinlinefunctionisdefinedas:
staticinlineint__check_is_bitmap(constunsignedlong*bitmap)
{
return1;
}
Andreturns1everytime.Weneedithereforonlyonepurpose:atcompiletimeitchecksthatagivenbitmapisabitmap,orinotherwordsitchecksthatagivenbitmaphastype-unsignedlong*.Sowejustpasscpu_possible_bitstotheto_cpumaskmacroforconvertinganarrayofunsignedlongtothestructcpumask*.
Aswecandefinecpumaskwithoneofthemethod,LinuxkernelprovidesAPIformanipulatingacpumask.Let'sconsideroneofthefunctionwhichpresentedabove.Forexampleset_cpu_online.Thisfunctiontakestwoparameters:
NumberofCPU;CPUstatus;
Implementationofthisfunctionlooksas:
voidset_cpu_online(unsignedintcpu,boolonline)
{
if(online){
cpumask_set_cpu(cpu,to_cpumask(cpu_online_bits));
cpumask_set_cpu(cpu,to_cpumask(cpu_active_bits));
}else{
cpumask_clear_cpu(cpu,to_cpumask(cpu_online_bits));
}
}
Firstofallitchecksthesecondstateparameterandcallscpumask_set_cpuorcpumask_clear_cpudependsonit.Herewecanseecastingtothestructcpumask*ofthesecondparameterinthecpumask_set_cpu.Inourcaseitiscpu_online_bitswhichisabitmapanddefinedas:
staticDECLARE_BITMAP(cpu_online_bits,CONFIG_NR_CPUS)__read_mostly;
cpumaskAPI
LinuxInside
342Cpumasks
Thecpumask_set_cpufunctionmakesonlyonecalltotheset_bitfunction:
staticinlinevoidcpumask_set_cpu(unsignedintcpu,structcpumask*dstp)
{
set_bit(cpumask_check(cpu),cpumask_bits(dstp));
}
Theset_bitfunctiontakestwoparameterstoo,andsetsagivenbit(firstparameter)inthememory(secondparameterorcpu_online_bitsbitmap).Wecanseeherethatbeforeset_bitwillbecalled,itstwoparameterswillbepassedtothe
cpumask_check;cpumask_bits.
Let'sconsiderthesetwomacros.Firstifcpumask_checkdoesnothinginourcaseandjustreturnsgivenparameter.Thesecondcpumask_bitsjustreturnsthebitsfieldfromthegivenstructcpumask*structure:
#definecpumask_bits(maskp)((maskp)->bits)
Nowlet'slookontheset_bitimplementation:
static__always_inlinevoid
set_bit(longnr,volatileunsignedlong*addr)
{
if(IS_IMMEDIATE(nr)){
asmvolatile(LOCK_PREFIX"orb%1,%0"
:CONST_MASK_ADDR(nr,addr)
:"iq"((u8)CONST_MASK(nr))
:"memory");
}else{
asmvolatile(LOCK_PREFIX"bts%1,%0"
:BITOP_ADDR(addr):"Ir"(nr):"memory");
}
}
Thisfunctionlooksscary,butitisnotsohardasitseems.FirstofallitpassesnrornumberofthebittotheIS_IMMEDIATEmacrowhichjustcallstheGCCinternal__builtin_constant_pfunction:
#defineIS_IMMEDIATE(nr)(__builtin_constant_p(nr))
__builtin_constant_pchecksthatgivenparameterisknownconstantatcompile-time.Asourcpuisnotcompile-timeconstant,theelseclausewillbeexecuted:
asmvolatile(LOCK_PREFIX"bts%1,%0":BITOP_ADDR(addr):"Ir"(nr):"memory");
Let'strytounderstandhowitworksstepbystep:
LOCK_PREFIXisax86lockinstruction.Thisinstructiontellsthecputooccupythesystembuswhiletheinstruction(s)willbeexecuted.ThisallowstheCPUtosynchronizememoryaccess,preventingsimultaneousaccessofmultipleprocessors(ordevices-theDMAcontrollerforexample)toonememorycell.
BITOP_ADDRcaststhegivenparametertothe(*(volatilelong*)andadds+mconstraints.+meansthatthisoperandisbothreadandwrittenbytheinstruction.mshowsthatthisisamemoryoperand.BITOP_ADDRisdefinedas:
LinuxInside
343Cpumasks
#defineBITOP_ADDR(x)"+m"(*(volatilelong*)(x))
Nextisthememoryclobber.Ittellsthecompilerthattheassemblycodeperformsmemoryreadsorwritestoitemsotherthanthoselistedintheinputandoutputoperands(forexample,accessingthememorypointedtobyoneoftheinputparameters).
Ir-immediateregisteroperand.
ThebtsinstructionsetsagivenbitinabitstringandstoresthevalueofagivenbitintheCFflag.Sowepassedthecpunumberwhichiszeroinourcaseandafterset_bitisexecuted,itsetsthezerobitinthecpu_online_bitscpumask.Itmeansthatthefirstcpuisonlineatthismoment.
Besidestheset_cpu_*API,cpumaskofcourseprovidesanotherAPIforcpumasksmanipulation.Let'sconsideritinshort.
cpumaskprovidesasetofmacrosforgettingthenumbersofCPUsinvariousstates.Forexample:
#definenum_online_cpus()cpumask_weight(cpu_online_mask)
ThismacroreturnstheamountofonlineCPUs.Itcallsthecpumask_weightfunctionwiththecpu_online_maskbitmap(readaboutit).Thecpumask_weightfunctionmakesonecallofthebitmap_weightfunctionwithtwoparameters:
cpumaskbitmap;nr_cpumask_bits-whichisNR_CPUSinourcase.
staticinlineunsignedintcpumask_weight(conststructcpumask*srcp)
{
returnbitmap_weight(cpumask_bits(srcp),nr_cpumask_bits);
}
andcalculatesthenumberofbitsinthegivenbitmap.Besidesthenum_online_cpus,cpumaskprovidesmacrosfortheallCPUstates:
num_possible_cpus;num_active_cpus;cpu_online;cpu_possible.
andmanymore.
BesidesthattheLinuxkernelprovidesthefollowingAPIforthemanipulationofcpumask:
for_each_cpu-iteratesovereverycpuinamask;for_each_cpu_not-iteratesovereverycpuinacomplementedmask;cpumask_clear_cpu-clearsacpuinacpumask;cpumask_test_cpu-testsacpuinamask;cpumask_setall-setallcpusinamask;cpumask_size-returnssizetoallocatefora'structcpumask'inbytes;
andmanymanymore...
AdditionalcpumaskAPI
LinuxInside
344Cpumasks
cpumaskdocumentation
Links
LinuxInside
345Cpumasks
Linuxkernelprovidesdifferentimplementationsofdatastructureslikedoublylinkedlist,B+tree,priorityheapandmanymanymore.
Thispartconsidersthefollowingdatastructuresandalgorithms:
DoublylinkedlistRadixtree
DataStructuresintheLinuxKernel
LinuxInside
346DataStructuresintheLinuxKernel
Linuxkernelprovidesitsownimplementationofdoublylinkedlist,whichyoucanfindintheinclude/linux/list.h.WewillstartDataStructuresintheLinuxkernelfromthedoublylinkedlistdatastructure.Why?Becauseitisverypopularinthekernel,justtrytosearch
Firstofall,let'slookonthemainstructureintheinclude/linux/types.h:
structlist_head{
structlist_head*next,*prev;
};
Youcannotethatitisdifferentfrommanyimplementationsofdoublylinkedlistwhichyouhaveseen.Forexample,thisdoublylinkedliststructurefromthegliblibrarylookslike:
structGList{
gpointerdata;
GList*next;
GList*prev;
};
Usuallyalinkedliststructurecontainsapointertotheitem.TheimplementationoflinkedlistinLinuxkerneldoesnot.Sothemainquestionis-wheredoestheliststorethedata?.Theactualimplementationoflinkedlistinthekernelis-Intrusivelist.Anintrusivelinkedlistdoesnotcontaindatainitsnodes-Anodejustcontainspointerstothenextandpreviousnodeandlistnodespartofthedatathatareaddedtothelist.Thismakesthedatastructuregeneric,soitdoesnotcareaboutentrydatatypeanymore.
Forexample:
structnmi_desc{
spinlock_tlock;
structlist_headhead;
};
Let'slookatsomeexamplestounderstandhowlist_headisusedinthekernel.AsIalreadywroteabout,therearemany,reallymanydifferentplaceswherelistsareusedinthekernel.Let'slookforanexampleinmiscellaneouscharacterdrivers.MisccharacterdriversAPIfromthedrivers/char/misc.cisusedforwritingsmalldriversforhandlingsimplehardwareorvirtualdevices.Thosedriverssharesamemajornumber:
#defineMISC_MAJOR10
buthavetheirownminornumber.Forexampleyoucanseeitwith:
ls-l/dev|grep10
crw-------1rootroot10,235Mar2112:01autofs
drwxr-xr-x10rootroot200Mar2112:01cpu
crw-------1rootroot10,62Mar2112:01cpu_dma_latency
crw-------1rootroot10,203Mar2112:01cuse
DataStructuresintheLinuxKernel
Doublylinkedlist
LinuxInside
347Doublylinkedlist
drwxr-xr-x2rootroot100Mar2112:01dri
crw-rw-rw-1rootroot10,229Mar2112:01fuse
crw-------1rootroot10,228Mar2112:01hpet
crw-------1rootroot10,183Mar2112:01hwrng
crw-rw----+1rootkvm10,232Mar2112:01kvm
crw-rw----1rootdisk10,237Mar2112:01loop-control
crw-------1rootroot10,227Mar2112:01mcelog
crw-------1rootroot10,59Mar2112:01memory_bandwidth
crw-------1rootroot10,61Mar2112:01network_latency
crw-------1rootroot10,60Mar2112:01network_throughput
crw-r-----1rootkmem10,144Mar2112:01nvram
brw-rw----1rootdisk1,10Mar2112:01ram10
crw--w----1roottty4,10Mar2112:01tty10
crw-rw----1rootdialout4,74Mar2112:01ttyS10
crw-------1rootroot10,63Mar2112:01vga_arbiter
crw-------1rootroot10,137Mar2112:01vhci
Nowlet'shaveacloselookathowlistsareusedinthemiscdevicedrivers.Firstofall,let'slookonmiscdevicestructure:
structmiscdevice
{
intminor;
constchar*name;
conststructfile_operations*fops;
structlist_headlist;
structdevice*parent;
structdevice*this_device;
constchar*nodename;
mode_tmode;
};
Wecanseethefourthfieldinthemiscdevicestructure-listwhichisalistofregistereddevices.Inthebeginningofthesourcecodefilewecanseethedefinitionofmisc_list:
staticLIST_HEAD(misc_list);
whichexpandstothedefinitionofvariableswithlist_headtype:
#defineLIST_HEAD(name)\
structlist_headname=LIST_HEAD_INIT(name)
andinitializesitwiththeLIST_HEAD_INITmacro,whichsetspreviousandnextentrieswiththeaddressofvariable-name:
#defineLIST_HEAD_INIT(name){&(name),&(name)}
Nowlet'slookonthemisc_registerfunctionwhichregistersamiscellaneousdevice.Atthestartitinitializesmiscdevice->listwiththeINIT_LIST_HEADfunction:
INIT_LIST_HEAD(&misc->list);
whichdoesthesameastheLIST_HEAD_INITmacro:
staticinlinevoidINIT_LIST_HEAD(structlist_head*list)
{
list->next=list;
list->prev=list;
LinuxInside
348Doublylinkedlist
}
Inthenextstepafteradeviceiscreatedbythedevice_createfunction,weaddittothemiscellaneousdeviceslistwith:
list_add(&misc->list,&misc_list);
Kernellist.hprovidesthisAPIfortheadditionofanewentrytothelist.Let'slookatitsimplementation:
staticinlinevoidlist_add(structlist_head*new,structlist_head*head)
{
__list_add(new,head,head->next);
}
Itjustcallsinternalfunction__list_addwiththe3givenparameters:
new-newentry.head-listheadafterwhichthenewitemwillbeinserted.head->next-nextitemafterlisthead.
Implementationofthe__list_addisprettysimple:
staticinlinevoid__list_add(structlist_head*new,
structlist_head*prev,
structlist_head*next)
{
next->prev=new;
new->next=next;
new->prev=prev;
prev->next=new;
}
Hereweaddanewitembetweenprevandnext.SomisclistwhichwedefinedatthestartwiththeLIST_HEAD_INITmacrowillcontainpreviousandnextpointerstothemiscdevice->list.
Thereisstillonequestion:howtogetlist'sentry.Thereisaspecialmacro:
#definelist_entry(ptr,type,member)\
container_of(ptr,type,member)
whichgetsthreeparameters:
ptr-thestructurelist_headpointer;type-structuretype;member-thenameofthelist_headwithinthestructure;
Forexample:
conststructmiscdevice*p=list_entry(v,structmiscdevice,list)
Afterthiswecanaccesstoanymiscdevicefieldwithp->minororp->nameandetc...Let'slookonthelist_entryimplementation:
LinuxInside
349Doublylinkedlist
#definelist_entry(ptr,type,member)\
container_of(ptr,type,member)
Aswecanseeitjustcallscontainer_ofmacrowiththesamearguments.Atfirstsight,thecontainer_oflooksstrange:
#definecontainer_of(ptr,type,member)({\
consttypeof(((type*)0)->member)*__mptr=(ptr);\
(type*)((char*)__mptr-offsetof(type,member));})
Firstofallyoucannotethatitconsistsoftwoexpressionsincurlybrackets.Thecompilerwillevaluatethewholeblockinthecurlybracesandusethevalueofthelastexpression.
Forexample:
#include<stdio.h>
intmain(){
inti=0;
printf("i=%d\n",({++i;++i;}));
return0;
}
willprint2.
Thenextpointistypeof,it'ssimple.Asyoucanunderstandfromitsname,itjustreturnsthetypeofthegivenvariable.WhenIfirstsawtheimplementationofthecontainer_ofmacro,thestrangestthingIfoundwasthezerointhe((type*)0)expression.Actuallythispointermagiccalculatestheoffsetofthegivenfieldfromtheaddressofthestructure,butaswehave0here,itwillbejustazerooffsetalongwiththefieldwidth.Let'slookatasimpleexample:
#include<stdio.h>
structs{
intfield1;
charfield2;
charfield3;
};
intmain(){
printf("%p\n",&((structs*)0)->field3);
return0;
}
willprint0x5.
Thenextoffsetofmacrocalculatesoffsetfromthebeginningofthestructuretothegivenstructure'sfield.Itsimplementationisverysimilartothepreviouscode:
#defineoffsetof(TYPE,MEMBER)((size_t)&((TYPE*)0)->MEMBER)
Let'ssummarizeallaboutcontainer_ofmacro.Thecontainer_ofmacroreturnstheaddressofthestructurebythegivenaddressofthestructure'sfieldwithlist_headtype,thenameofthestructurefieldwithlist_headtypeandtypeofthecontainerstructure.Atthefirstlinethismacrodeclaresthe__mptrpointerwhichpointstothefieldofthestructurethatptrpointstoandassignsptrtoit.Nowptrand__mptrpointtothesameaddress.Technicallywedon'tneedthislinebutit'susefulfortypechecking.Thefirstlineensuresthatthegivenstructure(typeparameter)hasamembercalledmember.Inthesecondlineitcalculatesoffsetofthefieldfromthestructurewiththeoffsetofmacroandsubtractsitfromthestructure
LinuxInside
350Doublylinkedlist
address.That'sall.
Ofcourselist_addandlist_entryisnottheonlyfunctionswhich<linux/list.h>provides.ImplementationofthedoublylinkedlistprovidesthefollowingAPI:
list_addlist_add_taillist_dellist_replacelist_movelist_is_lastlist_emptylist_cut_positionlist_splicelist_for_eachlist_for_each_entry
andmanymore.
LinuxInside
351Doublylinkedlist
Asyoualreadyknowlinuxkernelprovidesmanydifferentlibrariesandfunctionswhichimplementdifferentdatastructuresandalgorithms.Inthispartwewillconsideroneofthesedatastructures-Radixtree.TherearetwofileswhicharerelatedtoradixtreeimplementationandAPIinthelinuxkernel:
include/linux/radix-tree.hlib/radix-tree.c
Letstalkaboutwhataradixtreeis.Radixtreeisacompressedtriewhereatrieisadatastructurewhichimplementsaninterfaceofanassociativearrayandallowstostorevaluesaskey-value.Thekeysareusuallystrings,butanydatatypecanbeused.Atrieisdifferentfromann-treebecauseofitsnodes.Nodesofatriedonotstorekeys;instead,anodeofatriestoressinglecharacterlabels.Thekeywhichisrelatedtoagivennodeisderivedbytraversingfromtherootofthetreetothisnode.Forexample:
+-----------+ | | | "" |||
+------+-----------+------+ | | | | +----v------+ +-----v-----+ | | | | | g | | c |||||
+-----------+ +-----------+ | | | | +----v------+ +-----v-----+ | | | | | o | | a |||||
+-----------+ +-----------+ | | +-----v-----+ | | | t |||
+-----------+
Sointhisexample,wecanseethetriewithkeys,goandcat.Thecompressedtrieorradixtreediffersfromtrieinthatallintermediatesnodeswhichhaveonlyonechildareremoved.
Radixtreeinlinuxkernelisthedatastructurewhichmapsvaluestointegerkeys.Itisrepresentedbythefollowingstructuresfromthefileinclude/linux/radix-tree.h:
structradix_tree_root{
unsignedintheight;
gfp_tgfp_mask;
structradix_tree_node__rcu*rnode;
};
Thisstructurepresentstherootofaradixtreeandcontainsthreefields:
DataStructuresintheLinuxKernel
Radixtree
LinuxInside
352Radixtree
height-heightofthetree;gfp_mask-tellshowmemoryallocationswillbeperformed;rnode-pointertothechildnode.
Thefirstfieldwewilldiscussisgfp_mask:
Low-levelkernelmemoryallocationfunctionstakeasetofflagsas-gfp_mask,whichdescribeshowthatallocationistobeperformed.TheseGFP_flagswhichcontroltheallocationprocesscanhavefollowingvalues:(GF_NOIOflag)meanssleepandwaitformemory,(__GFP_HIGHMEMflag)meanshighmemorycanbeused,(GFP_ATOMICflag)meanstheallocationprocesshashigh-priorityandcan'tsleepetc.
GFP_NOIO-cansleepandwaitformemory;__GFP_HIGHMEM-highmemorycanbeused;GFP_ATOMIC-allocationprocessishigh-priorityandcan'tsleep;
etc.
Thenextfieldisrnode:
structradix_tree_node{
unsignedintpath;
unsignedintcount;
union{
struct{
structradix_tree_node*parent;
void*private_data;
};
structrcu_headrcu_head;
};
/*Fortreeuser*/
structlist_headprivate_list;
void__rcu*slots[RADIX_TREE_MAP_SIZE];
unsignedlongtags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};
Thisstructurecontainsinformationabouttheoffsetinaparentandheightfromthebottom,countofthechildnodesandfieldsforaccessingandfreeinganode.Thisfieldsaredescribedbelow:
path-offsetinparent&heightfromthebottom;count-countofthechildnodes;parent-pointertotheparentnode;private_data-usedbytheuserofatree;rcu_head-usedforfreeinganode;private_list-usedbytheuserofatree;
Thetwolastfieldsoftheradix_tree_node-tagsandslotsareimportantandinteresting.Everynodecancontainsasetofslotswhicharestorepointerstothedata.EmptyslotsinthelinuxkernelradixtreeimplementationstoreNULL.Radixtreesinthelinuxkernelalsosupportstagswhichareassociatedwiththetagsfieldsintheradix_tree_nodestructure.Tagsallowindividualbitstobesetonrecordswhicharestoredintheradixtree.
Nowthatweknowaboutradixtreestructure,itistimetolookonitsAPI.
Westartfromthedatastructureinitialization.Therearetwowaystoinitializeanewradixtree.ThefirstistouseRADIX_TREEmacro:
LinuxkernelradixtreeAPI
LinuxInside
353Radixtree
RADIX_TREE(name,gfp_mask);
`
Asyoucanseewepassthenameparameter,sowiththeRADIX_TREEmacrowecandefineandinitializeradixtreewiththegivenname.ImplementationoftheRADIX_TREEiseasy:
#defineRADIX_TREE(name,mask)\
structradix_tree_rootname=RADIX_TREE_INIT(mask)
#defineRADIX_TREE_INIT(mask){\
.height=0,\
.gfp_mask=(mask),\
.rnode=NULL,\
}
AtthebeginningoftheRADIX_TREEmacrowedefineinstanceoftheradix_tree_rootstructurewiththegivennameandcallRADIX_TREE_INITmacrowiththegivenmask.TheRADIX_TREE_INITmacrojustinitializesradix_tree_rootstructurewiththedefaultvaluesandthegivenmask.
Thesecondwayistodefineradix_tree_rootstructurebyhandandpassitwithmasktotheINIT_RADIX_TREEmacro:
structradix_tree_rootmy_radix_tree;
INIT_RADIX_TREE(my_tree,gfp_mask_for_my_radix_tree);
where:
#defineINIT_RADIX_TREE(root,mask)\
do{\
(root)->height=0;\
(root)->gfp_mask=(mask);\
(root)->rnode=NULL;\
}while(0)
makesthesameinitialziationwithdefaultvaluesasitdoesRADIX_TREE_INITmacro.
Thenextaretwofunctionsforinsertinganddeletingrecordsto/fromaradixtree:
radix_tree_insert;radix_tree_delete;
Thefirstradix_tree_insertfunctiontakesthreeparameters:
rootofaradixtree;indexkey;datatoinsert;
Theradix_tree_deletefunctiontakesthesamesetofparametersastheradix_tree_insert,butwithoutdata.
Thesearchinaradixtreeimplementedintwoways:
radix_tree_lookup;radix_tree_gang_lookup;radix_tree_lookup_slot.
Thefirstradix_tree_lookupfunctiontakestwoparameters:
LinuxInside
354Radixtree
rootofaradixtree;indexkey;
Thisfunctiontriestofindthegivenkeyinthetreeandreturntherecordassociatedwiththiskey.Thesecondradix_tree_gang_lookupfunctionhavethefollowingsignature
unsignedintradix_tree_gang_lookup(structradix_tree_root*root,
void**results,
unsignedlongfirst_index,
unsignedintmax_items);
andreturnsnumberofrecords,sortedbythekeys,startingfromthefirstindex.Numberofthereturnedrecordswillnotbegreaterthanmax_itemsvalue.
Andthelastradix_tree_lookup_slotfunctionwillreturntheslotwhichwillcontainthedata.
RadixtreeTrie
Links
LinuxInside
355Radixtree
Thischapterdescribesvarioustheoreticalconceptsandconceptswhicharenotdirectlyrelatedtopracticebutusefultoknow.
PagingElf64format
Theory
LinuxInside
356Theory
InthefifthpartoftheseriesLinuxkernelbootingprocesswelearnedaboutwhatthekerneldoesinitsearlieststage.Inthenextstepthekernelwillinitializedifferentthingslikeinitrdmounting,lockdepinitialization,andmanymanyothersthings,beforewecanseehowthekernelrunsthefirstinitprocess.
Yeah,therewillbemanydifferentthings,butmanymanyandonceagainmanyworkwithmemory.
Inmyview,memorymanagementisoneofthemostcomplexpartofthelinuxkernelandinsystemprogrammingingeneral.Thisiswhybeforeweproceedwiththekernelinitializationstuff,weneedtogetacquaintedwithpaging.
Pagingisamechanismthattranslatesalinearmemoryaddresstoaphysicaladdress.Ifyouhavereadthepreviouspartsofthisbook,youmayrememberthatwesawsegmentationinrealmodewhenphysicaladdressesarecalculatedbyshiftingasegmentregisterbyfourandaddinganoffset.Wealsosawsegmentationinprotectedmode,whereweusedthedescriptortablesandbaseaddressesfromdescriptorswithoffsetstocalculatethephysicaladdresses.Nowthatwearein64-bitmode,willseepaging.
AstheIntelmanualsays:
Pagingprovidesamechanismforimplementingaconventionaldemand-paged,virtual-memorysystemwheresectionsofaprogram’sexecutionenvironmentaremappedintophysicalmemoryasneeded.
So...InthispostIwilltrytoexplainthetheorybehindpaging.Ofcourseitwillbecloselyrelatedtothex86_64versionofthelinuxkernelfor,butwewillnotgointotoomuchdetails(atleastinthispost).
Therearethreepagingmodes:
32-bitpaging;PAEpaging;IA-32epaging.
Wewillonlyexplainthelastmodehere.ToenabletheIA-32epagingpagingmodeweneedtodofollowingthings:
settheCR0.PGbit;settheCR4.PAEbit;settheIA32_EFER.LMEbit.
Wealreadysawwherethosethisbitsweresetinarch/x86/boot/compressed/head_64.S:
movl$(X86_CR0_PG|X86_CR0_PE),%eax
movl%eax,%cr0
and
movl$MSR_EFER,%ecx
rdmsr
Paging
Introduction
Enablingpaging
LinuxInside
357Paging
btsl$_EFER_LME,%eax
wrmsr
Pagingdividesthelinearaddressspaceintofixed-sizepages.Pagescanbemappedintothephysicaladdressspaceorevenexternalstorage.Thisfixedsizeis4096bytesforthex86_64linuxkernel.Toperformthelinearaddresstranslationtoaphysicaladdressspecialstructuresareused.Everystructureis4096bytessizeandcontains512entries(thisonlyforPAEandIA32_EFER.LMEmodes).Pagingstructuresarehierarchicalandthelinuxkerneluses4levelofpaginginthex86_64architecture.TheCPUusesapartofthelinearaddresstoidentifytheentryinanotherpagingstructurewhichisatthelowerlevelorphysicalmemoryregion(pageframe)orphysicaladdressinthisregion(pageoffset).Theaddressofthetoplevelpagingstructurelocatedinthecr3register.Wealreadysawthisinarch/x86/boot/compressed/head_64.S:
lealpgtable(%ebx),%eax
movl%eax,%cr3
Webuiltthepagetablestructuresandputtheaddressofthetop-levelstructureinthecr3register.Herecr3isusedtostoretheaddressofthetop-levelstructure,thePML4orPageGlobalDirectoryasitiscalledinthelinuxkernel.cr3is64-bitregisterandhasthefollowingstructure:
63525132
--------------------------------------------------------------------------------
|||
|ReservedMBZ|Addressofthetoplevelstructure|
|||
--------------------------------------------------------------------------------
31121154320
--------------------------------------------------------------------------------
|||P|P||
|Addressofthetoplevelstructure|Reserved|C|W|Reserved|
|||D|T||
--------------------------------------------------------------------------------
Thesefieldshavethefollowingmeanings:
Bits2:0-ignored;Bits51:12-storestheaddressofthetoplevelpagingstructure;Bit3and4-PWTorPage-LevelWritethroughandPCDorPage-levelcachedisableindicate.ThesebitscontrolthewaythepageorPageTableishandledbythehardwarecache;Reserved-reservedmustbe0;Bits63:52-reservedmustbe0.
Thelinearaddresstranslationaddressisfollowing:
AgivenlinearaddressarrivestotheMMUinsteadofmemorybus.64-bitlinearaddresssplitsonsomeparts.Onlylow48bitsaresignificant,itmeansthat2^48or256TBytesoflinear-addressspacemaybeaccessedatanygiventime.cr3registerstorestheaddressofthe4top-levelpagingstructure.47:39bitsofthegivenlinearaddressstoresanindexintothepagingstructurelevel-4,38:30bitsstoresindexintothepagingstructurelevel-3,29:21bitsstoresanindexintothepagingstructurelevel-2,20:12bitsstoresanindexintothepagingstructurelevel-1and11:0bitsprovidethebyteoffsetintothephysicalpage.
schematically,wecanimagineitlikethis:
Pagingstructures
LinuxInside
358Paging
Everyaccesstoalinearaddressiseitherasupervisor-modeaccessorauser-modeaccess.ThisaccessisdeterminedbytheCPL(currentprivilegelevel).IfCPL<3itisasupervisormodeaccesslevelotherwise,otherwiseitisausermodeaccesslevel.Forexample,thetoplevelpagetableentrycontainsaccessbitsandhasthefollowingstructure:
6362525132
--------------------------------------------------------------------------------
|N|||
||Available|Addressofthepagingstructureonlowerlevel|
|X|||
--------------------------------------------------------------------------------
3112119876543210
--------------------------------------------------------------------------------
|||M|I||P|P|U|W||
|Addressofthepagingstructureonlowerlevel|AVL|B|G|A|C|W|||P|
|||Z|N||D|T|S|R||
--------------------------------------------------------------------------------
Where:
63bit-N/Xbit(NoExecuteBit)-presentsabilitytoexecutethecodefromphysicalpagesmappedbythetableentry;62:52bits-ignoredbyCPU,usedbysystemsoftware;51:12bits-storesphysicaladdressofthelowerlevelpagingstructure;12:9bits-ignoredbyCPU;MBZ-mustbezerobits;Ignoredbits;A-accessedbitindicateswasphysicalpageorpagestructureaccessed;PWTandPCDusedforcache;U/S-user/supervisorbitcontrolsuseraccesstotheallphysicalpagesmappedbythistableentry;R/W-read/writebitcontrolsread/writeaccesstotheallphysicalpagesmappedbythistableentry;P-presentbit.Currentbitindicateswaspagetableorphysicalpageloadedintoprimarymemoryornot.
Ok,weknowaboutthepagingstructuresandtheirentries.Nowlet'sseesomedetailsabout4-levelpaginginthelinuxkernel.
LinuxInside
359Paging
Aswe'veseen,thelinuxkernelinx86_64uses4-levelpagetables.Theirnamesare:
PageGlobalDirectoryPageUpperDirectoryPageMiddleDirectoryPageTableEntry
Afteryou'vecompiledandinstalledthelinuxkernel,youcanseetheSystem.mapfilewhichstoresthevirtualaddressesofthefunctionsthatareusedbythekernel.Forexample:
$grep"start_kernel"System.map
ffffffff81efe497Tx86_64_start_kernel
ffffffff81efeaa2Tstart_kernel
Wecansee0xffffffff81efe497here.IdoubtyoureallyhavethatmuchRAMinstalled.Butanyway,start_kernelandx86_64_start_kernelwillbeexecuted.Theaddressspaceinx86_64is2^64size,butit'stoolarge,that'swhyasmalleraddressspaceisused,only48-bitswide.Sowehaveasituationwherethephysicaladdressspaceislimitedto48bits,butaddressingstillperformedwith64bitpointers.Howisthisproblemsolved?Lookatthisdiagram:
0xffffffffffffffff+-----------+
||
||Kernelspace
||
0xffff800000000000+-----------+
||
||
|hole|
||
||
0x00007fffffffffff+-----------+
||
||Userspace
||
0x0000000000000000 +-----------+
Thissolutionissignextension.Herewecanseethatthelower48bitsofavirtualaddresscanbeusedforaddressing.Bits63:48canbeeitheronlyzeroesoronlyones.Notethatthevirtualaddressspaceissplitin2parts:
KernelspaceUserspace
Userspaceoccupiesthelowerpartofthevirtualaddressspace,from0x000000000000000to0x00007fffffffffffandkernelspaceoccupiesthehighestpartfrom0xffff8000000000to0xffffffffffffffff.Notethatbits63:48is0foruserspaceand1forkernelspace.Alladdresseswhichareinkernelspaceandinuserspaceorinotherwordswhichhigher63:48bitsarezeroesoronesarecalledcanonicaladdresses.Thereisanon-canonicalareabetweenthesememoryregions.Togetherthesetwomemoryregions(kernelspaceanduserspace)areexactly2^48bitswide.Wecanfindthevirtualmemorymapwith4levelpagetablesintheDocumentation/x86/x86_64/mm.txt:
0000000000000000-00007fffffffffff(=47bits)userspace,differentpermm
holecausedby[48:63]signextension
ffff800000000000-ffff87ffffffffff(=43bits)guardhole,reservedforhypervisor
ffff880000000000-ffffc7ffffffffff(=64TB)directmappingofallphys.memory
ffffc80000000000-ffffc8ffffffffff(=40bits)hole
ffffc90000000000-ffffe8ffffffffff(=45bits)vmalloc/ioremapspace
ffffe90000000000-ffffe9ffffffffff(=40bits)hole
ffffea0000000000-ffffeaffffffffff(=40bits)virtualmemorymap(1TB)
Pagingstructuresinthelinuxkernel
LinuxInside
360Paging
...unusedhole...
ffffec0000000000-fffffc0000000000(=44bits)kasanshadowmemory(16TB)
...unusedhole...
ffffff0000000000-ffffff7fffffffff(=39bits)%espfixupstacks
...unusedhole...
ffffffff80000000-ffffffffa0000000(=512MB)kerneltextmapping,fromphys0
ffffffffa0000000-ffffffffff5fffff(=1525MB)modulemappingspace
ffffffffff600000-ffffffffffdfffff(=8MB)vsyscalls
ffffffffffe00000-ffffffffffffffff(=2MB)unusedhole
Wecanseeherethememorymapforuserspace,kernelspaceandthenon-canonicalareain-betweenthem.Theuserspacememorymapissimple.Let'stakeacloserlookatthekernelspace.Wecanseethatitstartsfromtheguardholewhichisreservedforthehypervisor.Wecanfindthedefinitionofthisguardholeinarch/x86/include/asm/page_64_types.h:
#define__PAGE_OFFSET_AC(0xffff880000000000,UL)
Previouslythisguardholeand__PAGE_OFFSETwasfrom0xffff800000000000to0xffff80fffffffffftopreventaccesstonon-canonicalarea,butwaslaterextendedby3bitsforthehypervisor.
Nextisthelowestusableaddressinkernelspace-ffff880000000000.Thisvirtualmemoryregionisfordirectmappingoftheallphysicalmemory.Afterthememoryspacewhichmapsallphysicaladdresses,theguardhole.Itneedstobebetweenthedirectmappingofallthephysicalmemoryandthevmallocarea.Afterthevirtualmemorymapforthefirstterabyteandtheunusedholeafterit,wecanseethekasanshadowmemory.Itwasaddedbycommitandprovidesthekerneladdresssanitizer.Afterthenextunusedholewecanseetheespfixupstacks(wewilltalkaboutitinotherpartsofthisbook)andthestartofthekerneltextmappingfromthephysicaladdress-0.Wecanfindthedefinitionofthisaddressinthesamefileasthe__PAGE_OFFSET:
#define__START_KERNEL_map_AC(0xffffffff80000000,UL)
Usuallykernel's.textstartherewiththeCONFIG_PHYSICAL_STARToffset.WesawitinthepostaboutELF64:
readelf-svmlinux|grepffffffff81000000
1:ffffffff810000000SECTIONLOCALDEFAULT1
65099:ffffffff810000000NOTYPEGLOBALDEFAULT1_text
90766:ffffffff810000000NOTYPEGLOBALDEFAULT1startup_64
HereicheckedvmlinuxwiththeCONFIG_PHYSICAL_STARTis0x1000000.Sowehavethestartpointofthekernel.text-0xffffffff80000000andoffset-0x1000000,theresultedvirtualaddresswillbe0xffffffff80000000+1000000=0xffffffff81000000.
Afterthekernel.textregionthereisthevirtualmemoryregionforkernelmodules,vsyscallsandanunusedholeof2megabytes.
We'veseenhowthekernel'svirtualmemorymapislaidoutandhowavirtualaddressistranslatedintoaphysicalone.Let'stakeforexamplefollowingaddress:
0xffffffff81000000
Inbinaryitwillbe:
1111111111111111111111111111111110000001000000000000000000000000
63:4847:3938:3029:2120:1211:0
LinuxInside
361Paging
Thisvirtualaddressissplitinpartsasdescribedabove:
63:48-bitsnotused;47:39-bitsofthegivenlinearaddressstoresanindexintothepagingstructurelevel-4;38:30-bitsstoresindexintothepagingstructurelevel-3;29:21-bitsstoresanindexintothepagingstructurelevel-2;20:12-bitsstoresanindexintothepagingstructurelevel-1;11:0-bitsprovidethebyteoffsetintothephysicalpage.
Thatisall.Nowyouknowalittleabouttheoryofpagingandwecangoaheadinthekernelsourcecodeandseethefirstinitializationsteps.
It'stheendofthisshortpartaboutpagingtheory.Ofcoursethispostdoesn'tcovereverydetailofpaging,butsoonwe'llseeinpracticehowthelinuxkernelbuildspagingstructuresandworkswiththem.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.Ifyou'vefoundanymistakespleasesendmePRtolinux-internals.
PagingonWikipediaIntel64andIA-32architecturessoftwaredeveloper'smanualvolume3AMMUELF64Documentation/x86/x86_64/mm.txtLastpart-Kernelbootingprocess
Conclusion
Links
LinuxInside
362Paging
ELF(ExecutableandLinkableFormat)isastandardfileformatforexecutablefiles,objectcode,sharedlibrariesandcoredumps.LinuxandmanyUNIX-likeoperatingsystemsusethisformat.Let'slookatthestructureoftheELF-64ObjectFileFormatandsomedefinitionsinthelinuxkernelsourcecodewhichrelatedwithit.
AnELFobjectfileconsistsofthefollowingparts:
ELFheader-describesthemaincharacteristicsoftheobjectfile:type,CPUarchitecture,thevirtualaddressoftheentrypoint,thesizeandoffsetoftheremainingparts,etc...;Programheadertable-liststheavailablesegmentsandtheirattributes.Programheadertableneedloadersforplacingsectionsofthefileasvirtualmemorysegments;Sectionheadertable-containsthedescriptionofthesections.
Nowlet'shaveacloserlookonthesecomponents.
ELFheader
TheELFheaderislocatedatthebeginningoftheobjectfile.Itsmainpurposeistolocateallotherpartsoftheobjectfile.TheFileheadercontainsthefollowingfields:
ELFidentification-arrayofbyteswhichhelpsidentifythefileasanELFobjectfileandalsoprovidesinformationaboutgeneralobjectfilecharacteristic;Objectfiletype-identifiestheobjectfiletype.ThisfieldcandescribethatELFfileisarelocatableobjectfile,anexecutablefile,etc...;Targetarchitecture;Versionoftheobjectfileformat;Virtualaddressoftheprogramentrypoint;Fileoffsetoftheprogramheadertable;Fileoffsetofthesectionheadertable;SizeofanELFheader;Sizeofaprogramheadertableentry;andotherfields...
Youcanfindtheelf64_hdrstructurewhichpresentsELF64headerinthelinuxkernelsourcecode:
typedefstructelf64_hdr{
unsignedchare_ident[EI_NIDENT];
Elf64_Halfe_type;
Elf64_Halfe_machine;
Elf64_Worde_version;
Elf64_Addre_entry;
Elf64_Offe_phoff;
Elf64_Offe_shoff;
Elf64_Worde_flags;
Elf64_Halfe_ehsize;
Elf64_Halfe_phentsize;
Elf64_Halfe_phnum;
Elf64_Halfe_shentsize;
Elf64_Halfe_shnum;
Elf64_Halfe_shstrndx;
}Elf64_Ehdr;
Thisstructuredefinedintheelf.h
Sections
ExecutableandLinkableFormat
LinuxInside
363Elf64
AlldatastoresinasectionsinanElfobjectfile.Sectionsidentifiedbyindexinthesectionheadertable.Sectionheadercontainsfollowingfields:
Sectionname;Sectiontype;Sectionattributes;Virtualaddressinmemory;Offsetinfile;Sizeofsection;Linktoothersection;Miscellaneousinformation;Addressalignmentboundary;Sizeofentries,ifsectionhastable;
Andpresentedwiththefollowingelf64_shdrstructureinthelinuxkernel:
typedefstructelf64_shdr{
Elf64_Wordsh_name;
Elf64_Wordsh_type;
Elf64_Xwordsh_flags;
Elf64_Addrsh_addr;
Elf64_Offsh_offset;
Elf64_Xwordsh_size;
Elf64_Wordsh_link;
Elf64_Wordsh_info;
Elf64_Xwordsh_addralign;
Elf64_Xwordsh_entsize;
}Elf64_Shdr;
elf.h
Programheadertable
Allsectionsaregroupedintosegmentsinanexecutableorsharedobjectfile.Programheaderisanarrayofstructureswhichdescribeeverysegment.Itlookslike:
typedefstructelf64_phdr{
Elf64_Wordp_type;
Elf64_Wordp_flags;
Elf64_Offp_offset;
Elf64_Addrp_vaddr;
Elf64_Addrp_paddr;
Elf64_Xwordp_filesz;
Elf64_Xwordp_memsz;
Elf64_Xwordp_align;
}Elf64_Phdr;
inthelinuxkernelsourcecode.
elf64_phdrdefinedinthesameelf.h.
TheELFobjectfilealsocontainsotherfields/structureswhichyoucanfindintheDocumentation.Nowlet'salookatthevmlinuxELFobject.
vmlinuxisalsoarelocatableELFobjectfile.Wecantakealookatitwiththereadelfutil.Firstofalllet'slookatthe
vmlinux
LinuxInside
364Elf64
header:
$readelf-hvmlinux
ELFHeader:
Magic:7f454c46020101000000000000000000
Class:ELF64
Data:2'scomplement,littleendian
Version:1(current)
OS/ABI:UNIX-SystemV
ABIVersion:0
Type:EXEC(Executablefile)
Machine:AdvancedMicroDevicesX86-64
Version:0x1
Entrypointaddress:0x1000000
Startofprogramheaders:64(bytesintofile)
Startofsectionheaders:381608416(bytesintofile)
Flags:0x0
Sizeofthisheader:64(bytes)
Sizeofprogramheaders:56(bytes)
Numberofprogramheaders:5
Sizeofsectionheaders:64(bytes)
Numberofsectionheaders:73
Sectionheaderstringtableindex:70
Herewecanseethatvmlinuxisa64-bitexecutablefile.
WecanreadfromtheDocumentation/x86/x86_64/mm.txt:
ffffffff80000000-ffffffffa0000000(=512MB)kerneltextmapping,fromphys0
WecanthenlookthisaddressupinthevmlinuxELFobjectwith:
$readelf-svmlinux|grepffffffff81000000
1:ffffffff810000000SECTIONLOCALDEFAULT1
65099:ffffffff810000000NOTYPEGLOBALDEFAULT1_text
90766:ffffffff810000000NOTYPEGLOBALDEFAULT1startup_64
Notethattheaddressofthestartup_64routineisnotffffffff80000000,butffffffff81000000andnowI'llexplainwhy.
Wecanseefollowingdefinitioninthearch/x86/kernel/vmlinux.lds.S:
.=__START_KERNEL;
...
...
..
/*Textandread-onlydata*/
.text:AT(ADDR(.text)-LOAD_OFFSET){
_text=.;
...
...
...
}
Where__START_KERNELis:
#define__START_KERNEL(__START_KERNEL_map+__PHYSICAL_START)
__START_KERNEL_mapisthevaluefromthedocumentation-ffffffff80000000and__PHYSICAL_STARTis0x1000000.That's
LinuxInside
365Elf64
whyaddressofthestartup_64isffffffff81000000.
Andatlastwecangetprogramheadersfromvmlinuxwiththefollowingcommand:
readelf-lvmlinux
ElffiletypeisEXEC(Executablefile)
Entrypoint0x1000000
Thereare5programheaders,startingatoffset64
ProgramHeaders:
TypeOffsetVirtAddrPhysAddr
FileSizMemSizFlagsAlign
LOAD0x00000000002000000xffffffff810000000x0000000001000000
0x0000000000cfd0000x0000000000cfd000RE200000
LOAD0x00000000010000000xffffffff81e000000x0000000001e00000
0x00000000001000000x0000000000100000RW200000
LOAD0x00000000012000000x00000000000000000x0000000001f00000
0x0000000000014d980x0000000000014d98RW200000
LOAD0x00000000013150000xffffffff81f150000x0000000001f15000
0x000000000011d0000x0000000000279000RWE200000
NOTE0x0000000000b172840xffffffff819172840x0000000001917284
0x00000000000000240x00000000000000244
SectiontoSegmentmapping:
SegmentSections...
00.text.notes__ex_table.rodata__bug_table.pci_fixup.builtin_fw
.tracedata__ksymtab__ksymtab_gpl__kcrctab__kcrctab_gpl
__ksymtab_strings__param__modver
01.data.vvar
02.data..percpu
03.init.text.init.data.x86_cpu_dev.init.altinstructions
.altinstr_replacement.iommu_table.apicdrivers.exit.text
.smp_locks.data_nosave.bss.brk
Herewecanseefivesegmentswithsectionslist.Youcanfindallofthesesectionsinthegeneratedlinkerscriptat-arch/x86/kernel/vmlinux.lds.
That'sall.Ofcourseit'snotafulldescriptionofELF(ExecutableandLinkableFormat),butifyouwanttoknowmore,youcanfindthedocumentation-here
LinuxInside
366Elf64
ThischaptercontainspartswhicharenotdirectlyrelatedtotheLinuxkernelsourcecodeandimplementationofdifferentsubsystems.
Misc
LinuxInside
367Misc
Iwon'ttellyouhowtobuildandinstallacustomLinuxkernelonyourmachine.Ifyouneedhelpwiththis,youcanfindmanyresourcesthatwillhelpyoudoit.Instead,wewilllearnwhatoccurswhenyouexecutemakeintherootdirectoryoftheLinuxkernelsourcecode.
WhenIstartedtostudythesourcecodeoftheLinuxkernel,themakefilewasthefirstfilethatIopened.Anditwasscary:).Themakefilecontained1591linesofcodewhenIwrotethispartandthekernelwasthe4.2.0-rc3release.
ThismakefileisthetopmakefileintheLinuxkernelsourcecodeandthekernelbuildingstartshere.Yes,itisbig,butmoreover,ifyou'vereadthesourcecodeoftheLinuxkernelyoumayhavenotedthatalldirectoriescontainingsourcecodehasitsownmakefile.Ofcourseitisnotpossibletodescribehoweachsourcefileiscompiledandlinked,sowewillonlystudythestandardcompilationcase.Youwillnotfindherebuildingofthekernel'sdocumentation,cleaningofthekernelsourcecode,tagsgeneration,cross-compilationrelatedstuff,etc...WewillstartfromthemakeexecutionwiththestandardkernelconfigurationfileandwillfinishwiththebuildingofthebzImage.
Itwouldbebetterifyou'realreadyfamiliarwiththemakeutil,butIwilltrytodescribeeverypieceofcodeinthispartanyway.
Solet'sstart.
Therearemanythingstopreparebeforethekernelcompilationcanbestarted.Themainpointhereistofindandconfigurethetypeofcompilation,toparsecommandlineargumentsthatarepassedtomake,etc...Solet'sdiveintothetopMakefileofLinuxkernel.
ThetopMakefileofLinuxkernelisresponsibleforbuildingtwomajorproducts:vmlinux(theresidentkernelimage)andthemodules(anymodulefiles).TheMakefileoftheLinuxkernelstartswiththedefinitionoffollowingvariables:
VERSION=4
PATCHLEVEL=2
SUBLEVEL=0
EXTRAVERSION=-rc3
NAME=HurrdurrI'masheep
ThesevariablesdeterminethecurrentversionofLinuxkernelandareusedindifferentplaces,forexampleintheformingoftheKERNELVERSIONvariableinthesameMakefile:
KERNELVERSION=$(VERSION)$(if$(PATCHLEVEL),.$(PATCHLEVEL)$(if$(SUBLEVEL),.$(SUBLEVEL)))$(EXTRAVERSION)
Afterthiswecanseeacoupleofifeqconditionsthatchecksomeoftheparameterspassedtomake.TheLinuxkernelmakefilesprovidesaspecialmakehelptargetthatprintsallavailabletargetsandsomeofthecommandlineargumentsthatcanbepassedtomake.Forexample:makeV=1=>verbosebuild.ThefirstifeqcheckswhethertheV=noptionispassedtomake:
ifeq("$(originV)","commandline")
ProcessoftheLinuxkernelbuilding
Introduction
Preparationbeforethekernelcompilation
LinuxInside
368Howthekerneliscompiled
KBUILD_VERBOSE=$(V)
endif
ifndefKBUILD_VERBOSE
KBUILD_VERBOSE=0
endif
ifeq($(KBUILD_VERBOSE),1)
quiet=
Q=
else
quiet=quiet_
Q=@
endif
exportquietQKBUILD_VERBOSE
Ifthisoptionispassedtomake,wesettheKBUILD_VERBOSEvariabletothevalueofVoption.OtherwisewesettheKBUILD_VERBOSEvariabletozero.AfterthiswecheckthevalueofKBUILD_VERBOSEvariableandsetvaluesofthequietandQvariablesdependingonthevalueofKBUILD_VERBOSEvariable.The@symbolssuppresstheoutputofcommand.Andifitispresentbeforeacommandtheoutputwillbesomethinglikethis:CCscripts/mod/empty.oinsteadofCompiling....scripts/mod/empty.o.Intheendwejustexportallofthesevariables.ThenextifeqstatementchecksthatO=/diroptionwaspassedtothemake.Thisoptionallowstolocatealloutputfilesinthegivendir:
ifeq($(KBUILD_SRC),)
ifeq("$(originO)","commandline")
KBUILD_OUTPUT:=$(O)
endif
ifneq($(KBUILD_OUTPUT),)
saved-output:=$(KBUILD_OUTPUT)
KBUILD_OUTPUT:=$(shellmkdir-p$(KBUILD_OUTPUT)&&cd$(KBUILD_OUTPUT)\
&&/bin/pwd)
$(if$(KBUILD_OUTPUT),,\
$(errorfailedtocreateoutputdirectory"$(saved-output)"))
sub-make:FORCE
$(Q)$(MAKE)-C$(KBUILD_OUTPUT)KBUILD_SRC=$(CURDIR)\
-f$(CURDIR)/Makefile$(filter-out_allsub-make,$(MAKECMDGOALS))
skip-makefile:=1
endif#ifneq($(KBUILD_OUTPUT),)
endif#ifeq($(KBUILD_SRC),)
WechecktheKBUILD_SRCthatrepresentsthetopdirectoryofthekernelsourcecodeandwhetheritisempty(itisemptywhenthemakefileisexecutedforthefirsttime).WethensettheKBUILD_OUTPUTvariabletothevaluepassedwiththeOoption(ifthisoptionwaspassed).InthenextstepwecheckthisKBUILD_OUTPUTvariableandifitisset,wedofollowingthings:
StorethevalueofKBUILD_OUTPUTinthetemporarysaved-outputvariable;Trytocreatethegivenoutputdirectory;Checkthatdirectorycreated,inotherwayprinterrormessage;Ifthecustomoutputdirectorywascreatedsuccessfully,executemakeagainwiththenewdirectory(seethe-Coption).
ThenextifeqstatementscheckthattheCorMoptionspassedtomake:
ifeq("$(originC)","commandline")
KBUILD_CHECKSRC=$(C)
endif
ifndefKBUILD_CHECKSRC
KBUILD_CHECKSRC=0
endif
LinuxInside
369Howthekerneliscompiled
ifeq("$(originM)","commandline")
KBUILD_EXTMOD:=$(M)
endif
TheCoptiontellsthemakefilethatweneedtocheckallcsourcecodewithatoolprovidedbythe$CHECKenvironmentvariable,bydefaultitissparse.ThesecondMoptionprovidesbuildfortheexternalmodules(willnotseethiscaseinthispart).WealsocheckwhethertheKBUILD_SRCvariableisset,andifitisn't,wesetthesrctreevariableto.:
ifeq($(KBUILD_SRC),)
srctree:=.
endif
objtree:=.
src:=$(srctree)
obj:=$(objtree)
exportsrctreeobjtreeVPATH
ThattellsMakefilethatthekernelsourcetreewillbeinthecurrentdirectorywheremakewasexecuted.Wethensetobjtreeandothervariablestothisdirectoryandexportthem.ThenextstepistogetvaluefortheSUBARCHvariablethatrepresentswhattheunderlyingarchitectureis:
SUBARCH:=$(shelluname-m|sed-es/i.86/x86/-es/x86_64/x86/\
-es/sun4u/sparc64/\
-es/arm.*/arm/-es/sa110/arm/\
-es/s390x/s390/-es/parisc64/parisc/\
-es/ppc.*/powerpc/-es/mips.*/mips/\
-es/sh[234].*/sh/-es/aarch64.*/arm64/)
Asyoucansee,itexecutestheunameutilthatprintsinformationaboutmachine,operatingsystemandarchitecture.Asitgetstheoutputofuname,itparsestheouputandassignstheresulttotheSUBARCHvariable.NowthatwehaveSUBARCH,wesettheSRCARCHvariablethatprovidesthedirectoryofthecertainarchitectureandhfr-archthatprovidesthedirectoryfortheheaderfiles:
ifeq($(ARCH),i386)
SRCARCH:=x86
endif
ifeq($(ARCH),x86_64)
SRCARCH:=x86
endif
hdr-arch:=$(SRCARCH)
NoteARCHisanaliasforSUBARCH.InthenextstepwesettheKCONFIG_CONFIGvariablethatrepresentspathtothekernelconfigurationfileandifitwasnotsetbefore,itissetto.configbydefault:
KCONFIG_CONFIG?=.config
exportKCONFIG_CONFIG
andtheshellthatwillbeusedduringkernelcompilation:
CONFIG_SHELL:=$(shellif[-x"$$BASH"];thenecho$$BASH;\
elseif[-x/bin/bash];thenecho/bin/bash;\
elseechosh;fi;fi)
LinuxInside
370Howthekerneliscompiled
ThenextsetofvariablesarerelatedtothecompilersusedduringLinuxkernelcompilation.Wesetthehostcompilersforthecandc++andtheflagstobeusedwiththem:
HOSTCC=gcc
HOSTCXX=g++
HOSTCFLAGS=-Wall-Wmissing-prototypes-Wstrict-prototypes-O2-fomit-frame-pointer-std=gnu89
HOSTCXXFLAGS=-O2
NextwegettotheCCvariablethatrepresentscompilertoo,sowhydoweneedtheHOST*variables?CCisthetargetcompilerthatwillbeusedduringkernelcompilation,butHOSTCCwillbeusedduringcompilationofthesetofthehostprograms(wewillseeitsoon).AfterthiswecanseethedefinitionofKBUILD_MODULESandKBUILD_BUILTINvariablesthatareusedtodeterminewhattocompile(modules,kernel,orboth):
KBUILD_MODULES:=
KBUILD_BUILTIN:=1
ifeq($(MAKECMDGOALS),modules)
KBUILD_BUILTIN:=$(if$(CONFIG_MODVERSIONS),1)
endif
HerewecanseedefinitionofthesevariablesandthevalueofKBUILD_BUILTINvariablewilldependontheCONFIG_MODVERSIONSkernelconfigurationparameterifwepassonlymodulestomake.Thenextstepistoincludethekbuildfile.
includescripts/Kbuild.include
TheKbuildorKernelBuildSystemisthespecialinfrastructuretomanagethebuildofthekernelanditsmodules.Thekbuildfileshasthesamesyntaxthatmakefilesdo.Thescripts/Kbuild.includefileprovidessomegenericdefinitionsforthekbuildsystem.Asweincludedthiskbuildfileswecanseedefinitionofthevariablesthatarerelatedtothedifferenttoolsthatwillbeusedduringkernelandmodulescompilation(likelinker,compilers,utilsfromthebinutils,etc...):
AS=$(CROSS_COMPILE)as
LD=$(CROSS_COMPILE)ld
CC=$(CROSS_COMPILE)gcc
CPP=$(CC)-E
AR=$(CROSS_COMPILE)ar
NM=$(CROSS_COMPILE)nm
STRIP=$(CROSS_COMPILE)strip
OBJCOPY=$(CROSS_COMPILE)objcopy
OBJDUMP=$(CROSS_COMPILE)objdump
AWK=awk
...
...
...
Wethendefinetwoothervariables:USERINCLUDEandLINUXINCLUDE.Theycontainthepathsofthedirectorieswithheaderscz(publicforusersinthefirstcaseandforkernelinthesecondcase):
USERINCLUDE:=\
-I$(srctree)/arch/$(hdr-arch)/include/uapi\
-Iarch/$(hdr-arch)/include/generated/uapi\
-I$(srctree)/include/uapi\
-Iinclude/generated/uapi\
-include$(srctree)/include/linux/kconfig.h
LINUXINCLUDE:=\
-I$(srctree)/arch/$(hdr-arch)/include\
...
LinuxInside
371Howthekerneliscompiled
AndthestandardflagsfortheCcompiler:
KBUILD_CFLAGS:=-Wall-Wundef-Wstrict-prototypes-Wno-trigraphs\
-fno-strict-aliasing-fno-common\
-Werror-implicit-function-declaration\
-Wno-format-security\
-std=gnu89
Itisthenotlastcompilerflags,theycanbeupdatedbytheothermakefiles(forexamplekbuildsfromarch/).Afterallofthese,allvariableswillbeexportedtobeavailableintheothermakefiles.ThefollowingtwotheRCS_FIND_IGNOREandtheRCS_TAR_IGNOREvariableswillcontainfilesthatwillbeignoredintheversioncontrolsystem:
exportRCS_FIND_IGNORE:=\(-nameSCCS-o-nameBitKeeper-o-name.svn-o\
-nameCVS-o-name.pc-o-name.hg-o-name.git\)\
-prune-o
exportRCS_TAR_IGNORE:=--excludeSCCS--excludeBitKeeper--exclude.svn\
--excludeCVS--exclude.pc--exclude.hg--exclude.git
That'sall.Wehavefinishedwiththeallpreparations,nextpointisthebuildingofvmlinux.
Wehavenowfinishedallthepreparations,andnextstepinthemainmakefileisrelatedtothekernelbuild.Beforethismoment,nothinghasbeenprintedtotheterminalbymake.Butnowthefirststepsofthecompilationarestarted.Weneedtogotoline598oftheLinuxkerneltopmakefileandwewillfindthevmlinuxtargetthere:
all:vmlinux
includearch/$(SRCARCH)/Makefile
Don'tworrythatwehavemissedmanylinesinMakefilethatarebetweenexportRCS_FIND_IGNORE.....andall:vmlinux......Thispartofthemakefileisresponsibleforthemake*.configtargetsandasIwroteinthebeginningofthispartwewillseeonlybuildingofthekernelinageneralway.
Theall:targetisthedefaultwhennotargetisgivenonthecommandline.Youcanseeherethatweincludearchitecturespecificmakefilethere(inourcaseitwillbearch/x86/Makefile).Fromthismomentwewillcontinuefromthismakefile.Aswecanseealltargetdependsonthevmlinuxtargetthatdefinedalittlelowerinthetopmakefile:
vmlinux:scripts/link-vmlinux.sh$(vmlinux-deps)FORCE
ThevmlinuxistheLinuxkernelinastaticallylinkedexecutablefileformat.Thescripts/link-vmlinux.shscriptlinksandcombinesdifferentcompiledsubsystemsintovmlinux.Thesecondtargetisthevmlinux-depsthatdefinedas:
vmlinux-deps:=$(KBUILD_LDS)$(KBUILD_VMLINUX_INIT)$(KBUILD_VMLINUX_MAIN)
andconsistsfromthesetofthebuilt-in.ofromeachtopdirectoryoftheLinuxkernel.Later,whenwewillgothroughalldirectoriesintheLinuxkernel,theKbuildwillcompileallthe$(obj-y)files.Itthencalls$(LD)-rtomergethesefilesintoonebuilt-in.ofile.Forthismomentwehavenovmlinux-deps,sothevmlinuxtargetwillnotbeexecutednow.Formevmlinux-depscontainsfollowingfiles:
Directlytothekernelbuild
LinuxInside
372Howthekerneliscompiled
arch/x86/kernel/vmlinux.ldsarch/x86/kernel/head_64.o
arch/x86/kernel/head64.oarch/x86/kernel/head.o
init/built-in.ousr/built-in.o
arch/x86/built-in.okernel/built-in.o
mm/built-in.ofs/built-in.o
ipc/built-in.osecurity/built-in.o
crypto/built-in.oblock/built-in.o
lib/lib.aarch/x86/lib/lib.a
lib/built-in.oarch/x86/lib/built-in.o
drivers/built-in.osound/built-in.o
firmware/built-in.oarch/x86/pci/built-in.o
arch/x86/power/built-in.oarch/x86/video/built-in.o
net/built-in.o
Thenexttargetthatcanbeexecutedisfollowing:
$(sort$(vmlinux-deps)):$(vmlinux-dirs);
$(vmlinux-dirs):preparescripts
$(Q)$(MAKE)$(build)=$@
Aswecanseevmlinux-dirsdependsontwotargets:prepareandscripts.prepareisdefinedinthetopMakefileoftheLinuxkernelandexecutesthreestagesofpreparations:
prepare:prepare0
prepare0:archprepareFORCE
$(Q)$(MAKE)$(build)=.
archprepare:archheadersarchscriptsprepare1scripts_basic
prepare1:prepare2$(version_h)include/generated/utsrelease.h\
include/config/auto.conf
$(cmd_crmodverdir)
prepare2:prepare3outputmakefileasm-generic
Thefirstprepare0expandstothearchpreparethatexpandstothearchheadersandarchscriptsthatdefinedinthex86_64specificMakefile.Let'slookonit.Thex86_64specificmakefilestartsfromthedefinitionofthevariablesthatarerelatedtothearchitecture-specificconfigs(defconfig,etc...).Afterthisitdefinesflagsforthecompilingofthe16-bitcode,calculatingoftheBITSvariablethatcanbe32fori386or64forthex86_64flagsfortheassemblysourcecode,flagsforthelinkerandmanymanymore(alldefinitionsyoucanfindinthearch/x86/Makefile).Thefirsttargetisarchheadersinthemakefilegeneratessyscalltable:
archheaders:
$(Q)$(MAKE)$(build)=arch/x86/entry/syscallsall
Andthesecondtargetisarchscriptsinthismakefileis:
archscripts:scripts_basic
$(Q)$(MAKE)$(build)=arch/x86/toolsrelocs
Wecanseethatitdependsonthescripts_basictargetfromthetopMakefile.Atthefirstwecanseethescripts_basictargetthatexecutesmakeforthescripts/basicmakefile:
scripts_basic:
$(Q)$(MAKE)$(build)=scripts/basic
LinuxInside
373Howthekerneliscompiled
Thescripts/basic/Makefilecontainstargetsforcompilationofthetwohostprograms:fixdepandbin2:
hostprogs-y:=fixdep
hostprogs-$(CONFIG_BUILD_BIN2C)+=bin2c
always:=$(hostprogs-y)
$(addprefix$(obj)/,$(filter-outfixdep,$(always))):$(obj)/fixdep
Firstprogramisfixdep-optimizeslistofdependenciesgeneratedbygccthattellsmakewhentoremakeasourcecodefile.Thesecondprogramisbin2c,whichdependsonthevalueoftheCONFIG_BUILD_BIN2CkernelconfigurationoptionandisaverylittleCprogramthatallowstoconvertabinaryonstdintoaCincludeonstdout.Youcannotehereastrangenotation:hostprogs-y,etc...Thisnotationisusedintheallkbuildfilesandyoucanreadmoreaboutitinthedocumentation.Inourcasehostprogs-ytellskbuildthatthereisonehostprogramnamedfixdepthatwillbebuiltfromfixdep.cthatislocatedinthesamedirectorywheretheMakefileis.Thefirstoutputafterweexecutemakeinourterminalwillberesultofthiskbuildfile:
$make
HOSTCCscripts/basic/fixdep
Asscript_basictargetwasexecuted,thearchscriptstargetwillexecutemakeforthearch/x86/toolsmakefilewiththerelocstarget:
$(Q)$(MAKE)$(build)=arch/x86/toolsrelocs
Therelocs_32.candtherelocs_64.cwillbecompiledthatwillcontainrelocationinformationandwewillseeitinthemakeoutput:
HOSTCCarch/x86/tools/relocs_32.o
HOSTCCarch/x86/tools/relocs_64.o
HOSTCCarch/x86/tools/relocs_common.o
HOSTLDarch/x86/tools/relocs
Thereischeckingoftheversion.haftercompilingoftherelocs.c:
$(version_h):$(srctree)/MakefileFORCE
$(callfilechk,version.h)
$(Q)rm-f$(old_version_h)
Wecanseeitintheoutput:
CHKinclude/config/kernel.release
andthebuildingofthegenericassemblyheaderswiththeasm-generictargetfromthearch/x86/include/generated/asmthatgeneratedinthetopMakefileoftheLinuxkernel.Aftertheasm-generictargetthearchpreparewillbedone,sotheprepare0targetwillbeexecuted.AsIwroteabove:
prepare0:archprepareFORCE
$(Q)$(MAKE)$(build)=.
LinuxInside
374Howthekerneliscompiled
Noteonthebuild.Itdefinedinthescripts/Kbuild.includeandlookslikethis:
build:=-f$(srctree)/scripts/Makefile.buildobj
Orinourcaseitiscurrentsourcedirectory-.:
$(Q)$(MAKE)-f$(srctree)/scripts/Makefile.buildobj=.
Thescripts/Makefile.buildtriestofindtheKbuildfilebythegivendirectoryviatheobjparameter,includethisKbuildfiles:
include$(kbuild-file)
andbuildtargetsfromit.Inourcase.containstheKbuildfilethatgeneratesthekernel/bounds.sandthearch/x86/kernel/asm-offsets.s.Afterthisthepreparetargetfinishedtowork.Thevmlinux-dirsalsodependsonthesecondtarget-scriptsthatcompilesfollowingprograms:file2alias,mk_elfconfig,modpost,etc.....Afterscripts/host-programscompilationourvmlinux-dirstargetcanbeexecuted.Firstofalllet'strytounderstandwhatdoesvmlinux-dirscontain.Formycaseitcontainspathsofthefollowingkerneldirectories:
initusrarch/x86kernelmmfsipcsecuritycryptoblock
driverssoundfirmwarearch/x86/pciarch/x86/power
arch/x86/videonetlibarch/x86/lib
Wecanfinddefinitionofthevmlinux-dirsinthetopMakefileoftheLinuxkernel:
vmlinux-dirs:=$(patsubst%/,%,$(filter%/,$(init-y)$(init-m)\
$(core-y)$(core-m)$(drivers-y)$(drivers-m)\
$(net-y)$(net-m)$(libs-y)$(libs-m)))
init-y:=init/
drivers-y:=drivers/sound/firmware/
net-y:=net/
libs-y:=lib/
...
...
...
Hereweremovethe/symbolfromtheeachdirectorywiththehelpofthepatsubstandfilterfunctionsandputittothevmlinux-dirs.Sowehavelistofdirectoriesinthevmlinux-dirsandthefollowingcode:
$(vmlinux-dirs):preparescripts
$(Q)$(MAKE)$(build)=$@
The$@representsvmlinux-dirsherethatmeansthatitwillgorecursivelyoveralldirectoriesfromthevmlinux-dirsanditsinternaldirectories(depensonconfiguration)andwillexecutemakeinthere.Wecanseeitintheoutput:
CCinit/main.o
CHKinclude/generated/compile.h
CCinit/version.o
CCinit/do_mounts.o
...
CCarch/x86/crypto/glue_helper.o
LinuxInside
375Howthekerneliscompiled
ASarch/x86/crypto/aes-x86_64-asm_64.o
CCarch/x86/crypto/aes_glue.o
...
ASarch/x86/entry/entry_64.o
ASarch/x86/entry/thunk_64.o
CCarch/x86/entry/syscall_64.o
Sourcecodeineachdirectorywillbecompiledandlinkedtothebuilt-in.o:
$find.-namebuilt-in.o
./arch/x86/crypto/built-in.o
./arch/x86/crypto/sha-mb/built-in.o
./arch/x86/net/built-in.o
./init/built-in.o
./usr/built-in.o
...
...
Ok,allbuint-in.o(s)built,nowwecanbacktothevmlinuxtarget.Asyouremember,thevmlinuxtargetisinthetopMakefileoftheLinuxkernel.Beforethelinkingofthevmlinuxitbuildssamples,Documentation,etc...butIwillnotdescribeithereasIwroteinthebeginningofthispart.
vmlinux:scripts/link-vmlinux.sh$(vmlinux-deps)FORCE
...
...
+$(callif_changed,link-vmlinux)
Asyoucanseemainpurposeofitisacallofthescripts/link-vmlinux.shscriptislinkingoftheallbuilt-in.o(s)totheonestaticallylinkedexecutableandcreationoftheSystem.map.Intheendwewillseefollowingoutput:
LINKvmlinux
LDvmlinux.o
MODPOSTvmlinux.o
GEN.version
CHKinclude/generated/compile.h
UPDinclude/generated/compile.h
CCinit/version.o
LDinit/built-in.o
KSYM.tmp_kallsyms1.o
KSYM.tmp_kallsyms2.o
LDvmlinux
SORTEXvmlinux
SYSMAPSystem.map
andvmlinuxandSystem.mapintherootoftheLinuxkernelsourcetree:
$lsvmlinuxSystem.map
System.mapvmlinux
That'sall,vmlinuxisready.ThenextstepiscreationofthebzImage.
ThebzImagefileisthecompressedLinuxkernelimage.WecangetitbyexecutingmakebzImageaftervmlinuxisbuilt.That,orwecanjustexecutemakewithoutanyargumentandwewillgetbzImageanywaybecauseitisdefaultimage:
BuildingbzImage
LinuxInside
376Howthekerneliscompiled
all:bzImage
inthearch/x86/kernel/Makefile.Let'slookonthistarget,itwillhelpustounderstandhowthisimagebuilds.AsIalreadysaidthebzImagetargetdefinedinthearch/x86/kernel/Makefileandlookslikethis:
bzImage:vmlinux
$(Q)$(MAKE)$(build)=$(boot)$(KBUILD_IMAGE)
$(Q)mkdir-p$(objtree)/arch/$(UTS_MACHINE)/boot
$(Q)ln-fsn../../x86/boot/bzImage$(objtree)/arch/$(UTS_MACHINE)/boot/$@
Wecanseehere,thatfirstofallcalledmakeforthebootdirectory,inourcaseitis:
boot:=arch/x86/boot
Themaingoalnowistobuildthesourcecodeinthearch/x86/bootandarch/x86/boot/compresseddirectories,buildsetup.binandvmlinux.bin,andbuildthebzImagefromthemintheend.Firsttargetinthearch/x86/boot/Makefileisthe$(obj)/setup.elf:
$(obj)/setup.elf:$(src)/setup.ld$(SETUP_OBJS)FORCE
$(callif_changed,ld)
Wealreadyhavethesetup.ldlinkerscriptinthearch/x86/bootdirectoryandtheSETUP_OBJSvariablethatexpandstotheallsourcefilesfromthebootdirectory.Wecanseefirstoutput:
ASarch/x86/boot/bioscall.o
CCarch/x86/boot/cmdline.o
ASarch/x86/boot/copy.o
HOSTCCarch/x86/boot/mkcpustr
CPUSTRarch/x86/boot/cpustr.h
CCarch/x86/boot/cpu.o
CCarch/x86/boot/cpuflags.o
CCarch/x86/boot/cpucheck.o
CCarch/x86/boot/early_serial_console.o
CCarch/x86/boot/edd.o
Thenextsourcefileisarch/x86/boot/header.S,butwecan'tbuilditnowbecausethistargetdependsonthefollowingtwoheaderfiles:
$(obj)/header.o:$(obj)/voffset.h$(obj)/zoffset.h
Thefirstisvoffset.hgeneratedbythesedscriptthatgetstwoaddressesfromthevmlinuxwiththenmutil:
#defineVO__end0xffffffff82ab0000
#defineVO__text0xffffffff81000000
Theyarethestartandtheendofthekernel.Thesecondiszoffset.hdepensonthevmlinuxtargetfromthearch/x86/boot/compressed/Makefile:
$(obj)/zoffset.h:$(obj)/compressed/vmlinuxFORCE
$(callif_changed,zoffset)
LinuxInside
377Howthekerneliscompiled
The$(obj)/compressed/vmlinuxtargetdependsonthevmlinux-objs-ythatcompilessourcecodefilesfromthearch/x86/boot/compresseddirectoryandgeneratesvmlinux.bin,vmlinux.bin.bz2,andcompilesprogramm-mkpiggy.Wecanseethisintheoutput:
LDSarch/x86/boot/compressed/vmlinux.lds
ASarch/x86/boot/compressed/head_64.o
CCarch/x86/boot/compressed/misc.o
CCarch/x86/boot/compressed/string.o
CCarch/x86/boot/compressed/cmdline.o
OBJCOPYarch/x86/boot/compressed/vmlinux.bin
BZIP2arch/x86/boot/compressed/vmlinux.bin.bz2
HOSTCCarch/x86/boot/compressed/mkpiggy
Wherevmlinux.binisthevmlinuxfilewithdebuginginformationandcommentsstrippedandthevmlinux.bin.bz2compressedvmlinux.bin.all+u32sizeofvmlinux.bin.all.Thevmlinux.bin.allisvmlinux.bin+vmlinux.relocs,wherevmlinux.relocsisthevmlinuxthatwashandledbytherelocsprogram(seeabove).Aswegotthesefiles,thepiggy.Sassemblyfileswillbegeneratedwiththemkpiggyprogramandcompiled:
MKPIGGYarch/x86/boot/compressed/piggy.S
ASarch/x86/boot/compressed/piggy.o
Thisassemblyfileswillcontainthecomputedoffsetfromthecompressedkernel.Afterthiswecanseethatzoffsetgenerated:
ZOFFSETarch/x86/boot/zoffset.h
Asthezoffset.handthevoffset.haregenerated,compilationofthesourcecodefilesfromthearch/x86/bootcanbecontinued:
ASarch/x86/boot/header.o
CCarch/x86/boot/main.o
CCarch/x86/boot/mca.o
CCarch/x86/boot/memory.o
CCarch/x86/boot/pm.o
ASarch/x86/boot/pmjump.o
CCarch/x86/boot/printf.o
CCarch/x86/boot/regs.o
CCarch/x86/boot/string.o
CCarch/x86/boot/tty.o
CCarch/x86/boot/video.o
CCarch/x86/boot/video-mode.o
CCarch/x86/boot/video-vga.o
CCarch/x86/boot/video-vesa.o
CCarch/x86/boot/video-bios.o
Asallsourcecodefileswillbecompiled,theywillbelinkedtothesetup.elf:
LDarch/x86/boot/setup.elf
or:
ld-melf_x86_64-Tarch/x86/boot/setup.ldarch/x86/boot/a20.oarch/x86/boot/bioscall.oarch/x86/boot/cmdline.oarch/x86/boot/copy.oarch/x86/boot/cpu.oarch/x86/boot/cpuflags.oarch/x86/boot/cpucheck.oarch/x86/boot/early_serial_console.oarch/x86/boot/edd.oarch/x86/boot/header.oarch/x86/boot/main.oarch/x86/boot/mca.oarch/x86/boot/memory.oarch/x86/boot/pm.oarch/x86/boot/pmjump.oarch/x86/boot/printf.oarch/x86/boot/regs.oarch/x86/boot/string.oarch/x86/boot/tty.oarch/x86/boot/video.oarch/x86/boot/video-mode.oarch/x86/boot/version.oarch/x86/boot/video-vga.oarch/x86/boot/video-vesa.oarch/x86/boot/video-bios.o-oarch/x86/boot/setup.elf
LinuxInside
378Howthekerneliscompiled
Thelasttwothingsisthecreationofthesetup.binthatwillcontaincompiledcodefromthearch/x86/boot/*directory:
objcopy-Obinaryarch/x86/boot/setup.elfarch/x86/boot/setup.bin
andthecreationofthevmlinux.binfromthevmlinux:
objcopy-Obinary-R.note-R.comment-Sarch/x86/boot/compressed/vmlinuxarch/x86/boot/vmlinux.bin
Intheendwecompilehostprogram:arch/x86/boot/tools/build.cthatwillcreateourbzImagefromthesetup.binandthevmlinux.bin:
arch/x86/boot/tools/buildarch/x86/boot/setup.binarch/x86/boot/vmlinux.binarch/x86/boot/zoffset.harch/x86/boot/bzImage
ActuallythebzImageistheconcatenatedsetup.binandthevmlinux.bin.IntheendwewillseetheoutputwhichisfamiliartoallwhooncebuilttheLinuxkernelfromsource:
Setupis16268bytes(paddedto16384bytes).
Systemis4704kB
CRC94a88f9a
Kernel:arch/x86/boot/bzImageisready(#5)
That'sall.
ItistheendofthispartandherewesawallstepsfromtheexecutionofthemakecommandtothegenerationofthebzImage.Iknow,theLinuxkernelmakefilesandprocessoftheLinuxkernelbuildingmayseemconfusingatfirstglance,butitisnotsohard.HopethispartwillhelpyouunderstandtheprocessofbuildingtheLinuxkernel.
GNUmakeutilLinuxkerneltopMakefilecross-compilationCtagssparsebzImageunameshellKbuildbinutilsgccDocumentationSystem.mapRelocation
Conclusion
Links
LinuxInside
379Howthekerneliscompiled
Duringthewritingofthelinux-insidesbookIhavereceivedmanyemailswithquestionsrelatedtothelinkerscriptandlinker-relatedsubjects.SoI'vedecidedtowritethistocoversomeaspectsofthelinkerandthelinkingofobjectfiles.
IfweopentheLinkerpageonWikipedia,wewillseefollowingdefinition:
Incomputerscience,alinkerorlinkeditorisacomputerprogramthattakesoneormoreobjectfilesgeneratedbyacompilerandcombinesthemintoasingleexecutablefile,libraryfile,oranotherobjectfile.
Ifyou'vewrittenatleastoneprogramonCinyourlife,youwillhaveseenfileswiththe*.oextension.Thesefilesareobjectfiles.Objectfilesareblocksofmachinecodeanddatawithplaceholderaddressesthatreferencedataandfunctionsinotherobjectfilesorlibraries,aswellasalistofitsownfunctionsanddata.Themainpurposeofthelinkeriscollect/handlethecodeanddataofeachobjectfile,turningitintothefinalexecutablefileorlibrary.Inthispostwewilltrytogothroughallaspectsofthisprocess.Let'sstart.
Let'screateasimpleprojectwiththefollowingstructure:
*-linkers
*--main.c
*--lib.c
*--lib.h
Ourmain.csourcecodefilecontains:
#include<stdio.h>
#include"lib.h"
intmain(intargc,char**argv){
printf("factorialof5is:%d\n",factorial(5));
return0;
}
Thelib.cfilecontains:
intfactorial(intbase){
intres=1,i=1;
if(base==0){
return1;
}
while(i<=base){
res*=i;
i++;
}
returnres;
}
Andthelib.hfilecontains:
Introduction
Linkingprocess
LinuxInside
380Linkers
#ifndefLIB_H
#defineLIB_H
intfactorial(intbase);
#endif
Nowlet'scompileonlythemain.csourcecodefilewith:
$gcc-cmain.c
Ifwelookinsidetheoutputtedobjectfilewiththenmutil,wewillseethefollowingoutput:
$nm-Amain.o
main.o:Ufactorial
main.o:0000000000000000Tmain
main.o:Uprintf
Thenmutilallowsustoseethelistofsymbolsfromthegivenobjectfile.Itconsistsofthreecolumns:thefirstisthenameofthegivenobjectfileandtheaddressofanyresolvedsymbols.Thesecondcolumncontainsacharacterthatrepresentsthestatusofthegivensymbol.InthiscasetheUmeansundefinedandtheTdenotesthatthesymbolsareplacedinthe.textsectionoftheobject.Thenmutilityshowsusherethatwehavethreesymbolsinthemain.csourcecodefile:
factorial-thefactorialfunctiondefinedinthelib.csourcecodefile.Itismarkedasundefinedherebecausewecompiledonlythemain.csourcecodefile,anditdoesnotknowanythingaboutcodefromthelib.cfilefornow;main-themainfunction;printf-thefunctionfromtheglibclibrary.main.cdoesnotknowanythingaboutitfornoweither.
Whatcanweunderstandfromtheoutputofnmsofar?Themain.oobjectfilecontainsthelocalsymbolmainataddress0000000000000000(itwillbefilledwithcorrectaddressafterisislinked),andtwounresolvedsymbols.Wecanseeallofthisinformationinthedisassemblyoutputofthemain.oobjectfile:
$objdump-Smain.o
main.o:fileformatelf64-x86-64
Disassemblyofsection.text:
0000000000000000<main>:
0:55push%rbp
1:4889e5mov%rsp,%rbp
4:4883ec10sub$0x10,%rsp
8:897dfcmov%edi,-0x4(%rbp)
b:488975f0mov%rsi,-0x10(%rbp)
f:bf05000000mov$0x5,%edi
14:e800000000callq19<main+0x19>
19:89c6mov%eax,%esi
1b:bf00000000mov$0x0,%edi
20:b800000000mov$0x0,%eax
25:e800000000callq2a<main+0x2a>
2a:b800000000mov$0x0,%eax
2f:c9leaveq
30:c3retq
Hereweareinterestedonlyinthetwocallqoperations.Thetwocallqoperationscontainlinkerstubs,orthefunctionnameandoffsetfromittothenextinstruction.Thesestubswillbeupdatedtotherealaddressesofthefunctions.Wecanseethesefunctions'nameswithinthefollowingobjdumpoutput:
$objdump-S-rmain.o
LinuxInside
381Linkers
...
14:e800000000callq19<main+0x19>
15:R_X86_64_PC32factorial-0x4
19:89c6mov%eax,%esi
...
25:e800000000callq2a<main+0x2a>
26:R_X86_64_PC32printf-0x4
2a:b800000000mov$0x0,%eax
...
The-ror--relocflagsoftheobjdumputilprinttherelocationentriesofthefile.Nowlet'slookinmoredetailattherelocationprocess.
Relocationistheprocessofconnectingsymbolicreferenceswithsymbolicdefinitions.Let'slookattheprevioussnippetfromtheobjdumpoutput:
14:e800000000callq19<main+0x19>
15:R_X86_64_PC32factorial-0x4
19:89c6mov%eax,%esi
Notethee800000000onthefirstline.Thee8istheopcodeofthecall,andtheremainderofthelineisarelativeoffset.Sothee800000000containsaone-byteoperationcodefollowedbyafour-byteaddress.Notethatthe00000000is4-bytes.Whyonly4-bytesifanaddresscanbe8-bytesinax86_64(64-bit)machine?Actuallywecompiledthemain.csourcecodefilewiththe-mcmodel=small!Fromthegccmanpage:
-mcmodel=small
Generatecodeforthesmallcodemodel:theprogramanditssymbolsmustbelinkedinthelower2GBoftheaddressspace.Pointersare64bits.Programscanbestaticallyordynamicallylinked.Thisisthedefaultcodemodel.
Ofcoursewedidn'tpassthisoptiontothegccwhenwecompiledthemain.c,butitisthedefault.Weknowthatourprogramwillbelinkedinthelower2GBoftheaddressspacefromthegccmanualextractabove.Fourbytesisthereforeenoughforthis.Sowehaveopcodeofthecallinstructionandanunknownaddress.Whenwecompilemain.cwithallitsdependenciestoanexecutablefile,andthenlookatthefactorialcallwesee:
$gccmain.clib.c-ofactorial|objdump-Sfactorial|grepfactorial
factorial:fileformatelf64-x86-64
...
...
0000000000400506<main>:
40051a:e818000000callq400537<factorial>
...
...
0000000000400537<factorial>:
400550:7507jne400559<factorial+0x22>
400557:eb1bjmp400574<factorial+0x3d>
400559:eb0ejmp400569<factorial+0x32>
40056f:7eeajle40055b<factorial+0x24>
...
...
Aswecanseeinthepreviousoutput,theaddressofthemainfunctionis0x0000000000400506.Whyitdoesnotstartfrom0x0?YoumayalreadyknowthatstandardCprogramsarelinkedwiththeglibcCstandardlibrary(assumingthe-nostdlibwasnotpassedtothegcc).Thecompiledcodeforaprogramincludesconstructorfunctionstoinitializedatain
Relocation
LinuxInside
382Linkers
theprogramwhentheprogramisstarted.Thesefunctionsneedtobecalledbeforetheprogramisstarted,orinanotherwordsbeforethemainfunctioniscalled.Tomaketheinitializationandterminationfunctionswork,thecompilermustoutputsomethingintheassemblercodetocausethosefunctionstobecalledattheappropriatetime.Executionofthisprogram
willstartfromthecodeplacedinthespecial.initsection.Wecanseethisinthebeginningoftheobjdumpoutput:
objdump-Sfactorial|less
factorial:fileformatelf64-x86-64
Disassemblyofsection.init:
00000000004003a8<_init>:
4003a8:4883ec08sub$0x8,%rsp
4003ac:488b05a5052000mov0x2005a5(%rip),%rax#600958<_DYNAMIC+0x1d0>
Notthatitstartsatthe0x00000000004003a8addressrelativetotheglibccode.WecancheckitalsointheELFoutputbyrunningreadelf:
$readelf-dfactorial|grep\(INIT\)
0x000000000000000c(INIT)0x4003a8
So,theaddressofthemainfunctionis0000000000400506andisoffsetfromthe.initsection.Aswecanseefromtheoutput,theaddressofthefactorialfunctionis0x0000000000400537andbinarycodeforthecallofthefactorialfunctionnowise818000000.Wealreadyknowthate8isopcodeforthecallinstruction,thenext18000000(notethataddressrepresentedaslittleendianforx86_64,soitis00000018)istheoffsetfromthecallqtothefactorialfunction:
>>>hex(0x40051a+0x18+0x5)==hex(0x400537)
True
Soweadd0x18and0x5totheaddressofthecallinstruction.Theoffsetismeasuredfromtheaddressofthefollowinginstruction.Ourcallinstructionis5-byteslong(e818000000)andthe0x18istheoffsetofthecallafterthefactorialfunction.Acompilergenerallycreateseachobjectfilewiththeprogramaddressesstartingatzero.Butifaprogramiscreatedfrommultipleobjectfiles,thesewilloverlap.
Whatwehaveseeninthissectionistherelocationprocess.Thisprocessassignsloadaddressestothevariouspartsoftheprogram,adjustingthecodeanddataintheprogramtoreflecttheassignedaddresses.
Ok,nowthatweknowalittleaboutlinkersandrelocationitistimetolearnmoreaboutlinkersbylinkingourobjectfiles.
Asyoucanunderstandfromthetitle,IwilluseGNUlinkerorjustldinthispost.Ofcoursewecanusegcctolinkourfactorialproject:
$gccmain.clib.o-ofactorial
andafteritwewillgetexecutablefile-factorialasaresult:
./factorial
factorialof5is:120
GNUlinker
LinuxInside
383Linkers
Butgccdoesnotlinkobjectfiles.Insteaditusescollect2whichisjustwrapperfortheGNUldlinker:
~$/usr/lib/gcc/x86_64-linux-gnu/4.9/collect2--version
collect2version4.9.3
/usr/bin/ld--version
GNUld(GNUBinutilsforDebian)2.25
...
...
...
Ok,wecanusegccanditwillproduceexecutablefileofourprogramforus.Butlet'slookhowtouseGNUldlinkerforthesamepurpose.Firstofalllet'strytolinktheseobjectfileswiththefollowingexample:
ldmain.olib.o-ofactorial
Trytodoitandyouwillgetfollowingerror:
$ldmain.olib.o-ofactorial
ld:warning:cannotfindentrysymbol_start;defaultingto00000000004000b0
main.o:Infunction`main':
main.c:(.text+0x26):undefinedreferenceto`printf'
Herewecanseetwoproblems:
Linkercan'tfind_startsymbol;Linkerdoesnotknowanythingaboutprintffunction.
Firstofalllet'strytounderstandwhatisthis_startentrysymbolthatappearstoberequiredforourprogramtorun?WhenIstartedtolearnprogrammingIlearnedthatthemainfunctionistheentrypointoftheprogram.Ithinkyoulearnedthistoo:)Butitactuallyisn'ttheentrypoint,it's_startinstead.The_startsymbolisdefinedinthecrt1.oobjectfile.Wecanfinditwiththefollowingcommand:
$objdump-S/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o:fileformatelf64-x86-64
Disassemblyofsection.text:
0000000000000000<_start>:
0:31edxor%ebp,%ebp
2:4989d1mov%rdx,%r9
...
...
...
Wepassthisobjectfiletotheldcommandasitsfirstargument(seeabove).Nowlet'strytolinkitandwilllookonresult:
ld/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o\
main.olib.o-ofactorial
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o:Infunction`_start':
/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:115:undefinedreferenceto`__libc_csu_fini'
/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:116:undefinedreferenceto`__libc_csu_init'
/tmp/buildd/glibc-2.19/csu/../sysdeps/x86_64/start.S:122:undefinedreferenceto`__libc_start_main'
main.o:Infunction`main':
main.c:(.text+0x26):undefinedreferenceto`printf'
LinuxInside
384Linkers
Unfortunatelywewillseeevenmoreerrors.Wecanseehereolderroraboutundefinedprintfandyetanotherthreeundefinedreferences:
__libc_csu_fini
__libc_csu_init
__libc_start_main
The_startsymbolisdefinedinthesysdeps/x86_64/start.Sassemblyfileintheglibcsourcecode.Wecanfindfollowingassemblycodelinesthere:
mov$__libc_csu_fini,%R8_LP
mov$__libc_csu_init,%RCX_LP
...
call__libc_start_main
Herewepassaddressoftheentrypointtothe.initand.finisectionthatcontaincodethatstartstoexecutewhentheprogramisranandthecodethatexecuteswhenprogramterminates.Andintheendweseethecallofthemainfunctionfromourprogram.Thesethreesymbolsaredefinedinthecsu/elf-init.csourcecodefile.Thefollowingtwoobjectfiles:
crtn.o;crti.o.
definethefunctionprologs/epilogsforthe.initand.finisections(withthe_initand_finisymbolsrespectively).
Thecrtn.oobjectfilecontainsthese.initand.finisections:
$objdump-S/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o
0000000000000000<.init>:
0:4883c408add$0x8,%rsp
4:c3retq
Disassemblyofsection.fini:
0000000000000000<.fini>:
0:4883c408add$0x8,%rsp
4:c3retq
Andthecrti.oobjectfilecontainsthe_initand_finisymbols.Let'strytolinkagainwiththesetwoobjectfiles:
$ld\
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o\
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o\
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.omain.olib.o\
-ofactorial
Andanywaywewillgetthesameerrors.Nowweneedtopass-lcoptiontotheld.Thisoptionwillsearchforthestandardlibraryinthepathspresentinthe$LD_LIBRARY_PATHenvironmentvariable.Let'strytolinkagainwitthe-lcoption:
$ld\
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o\
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o\
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.omain.olib.o-lc\
-ofactorial
Finallywegetanexecutablefile,butifwetrytorunit,wewillgetstrangeresults:
LinuxInside
385Linkers
$./factorial
bash:./factorial:Nosuchfileordirectory
What'stheproblemhere?Let'slookontheexecutablefilewiththereadelfutil:
$readelf-lfactorial
ElffiletypeisEXEC(Executablefile)
Entrypoint0x4003c0
Thereare7programheaders,startingatoffset64
ProgramHeaders:
TypeOffsetVirtAddrPhysAddr
FileSizMemSizFlagsAlign
PHDR0x00000000000000400x00000000004000400x0000000000400040
0x00000000000001880x0000000000000188RE8
INTERP0x00000000000001c80x00000000004001c80x00000000004001c8
0x000000000000001c0x000000000000001cR1
[Requestingprograminterpreter:/lib64/ld-linux-x86-64.so.2]
LOAD0x00000000000000000x00000000004000000x0000000000400000
0x00000000000006100x0000000000000610RE200000
LOAD0x00000000000006100x00000000006006100x0000000000600610
0x00000000000001cc0x00000000000001ccRW200000
DYNAMIC0x00000000000006100x00000000006006100x0000000000600610
0x00000000000001900x0000000000000190RW8
NOTE0x00000000000001e40x00000000004001e40x00000000004001e4
0x00000000000000200x0000000000000020R4
GNU_STACK0x00000000000000000x00000000000000000x0000000000000000
0x00000000000000000x0000000000000000RW10
SectiontoSegmentmapping:
SegmentSections...
00
01.interp
02.interp.note.ABI-tag.hash.dynsym.dynstr.gnu.version.gnu.version_r.rela.dyn.rela.plt.init.plt.text.fini.rodata.eh_frame
03.dynamic.got.got.plt.data
04.dynamic
05.note.ABI-tag
06
Noteonthestrangeline:
INTERP0x00000000000001c80x00000000004001c80x00000000004001c8
0x000000000000001c0x000000000000001cR1
[Requestingprograminterpreter:/lib64/ld-linux-x86-64.so.2]
The.interpsectionintheelffileholdsthepathnameofaprograminterpreterorinanotherwordsthe.interpsectionsimplycontainsanasciistringthatisthenameofthedynamiclinker.ThedynamiclinkeristhepartofLinuxthatloadsandlinkssharedlibrariesneededbyanexecutablewhenitisexecuted,bycopyingthecontentoflibrariesfromdisktoRAM.Aswecanseeintheoutputofthereadelfcommanditisplacedinthe/lib64/ld-linux-x86-64.so.2fileforthex86_64architecture.Nowlet'saddthe-dynamic-linkeroptionwiththepathofld-linux-x86-64.so.2totheldcallandwillseethefollowingresults:
$gcc-cmain.clib.c
$ld\
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o\
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o\
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.omain.olib.o\
-dynamic-linker/lib64/ld-linux-x86-64.so.2\
-lc-ofactorial
LinuxInside
386Linkers
Nowwecanrunitasnormalexecutablefile:
$./factorial
factorialof5is:120
Itworks!Withthefirstlinewecompilethemain.candthelib.csourcecodefilestoobjectfiles.Wewillgetthemain.oandthelib.oafterexecutionofthegcc:
$filelib.omain.o
lib.o:ELF64-bitLSBrelocatable,x86-64,version1(SYSV),notstripped
main.o:ELF64-bitLSBrelocatable,x86-64,version1(SYSV),notstripped
andafterthiswelinkobjectfilesofourprogramwiththeneededsystemobjectfilesandlibraries.WejustsawasimpleexampleofhowtocompileandlinkaCprogramwiththegcccompilerandGNUldlinker.InthisexamplewehaveusedacouplecommandlineoptionsoftheGNUlinker,butitsupportsmuchmorecommandlineoptionsthan-o,-dynamic-linker,etc...MoreoverGNUldhasitsownlanguagethatallowstocontrolthelinkingprocess.Inthenexttwoparagraphswewilllookintoit.
AsIalreadywroteandasyoucanseeinthemanualoftheGNUlinker,ithasbigsetofthecommandlineoptions.We'veseenacoupleofoptionsinthispost:-o<output>-thattellsldtoproduceanoutputfilecalledoutputastheresultoflinking,-l<name>thataddsthearchiveorobjectfilespecifiedbythename,-dynamic-linkerthatspecifiesthenameofthedynamiclinker.Ofcourseldsupportsmuchmorecommandlineoptions,let'slookatsomeofthem.
Thefirstusefulcommandlineoptionis@file.Inthiscasethefilespecifiesfilenamewherecommandlineoptionswillberead.Forexamplewecancreatefilewiththenamelinker.ld,putthereourcommandlineargumentsfromthepreviousexampleandexecuteitwith:
Thenextcommandlineoptionis-bor--format.ThiscommandlineoptionspecifiesformatoftheinputobjectfilesELF,DJGPP/COFFandetc.Thereisacommandlineoptionforthesamepurposebutfortheoutputfile:--oformat=output-format.
Thenextcommandlineoptionis--defsym.Fullformatofthiscommandlineoptionisthe--defsym=symbol=expression.Itallowstocreateglobalsymbolintheoutputfilecontainingtheabsoluteaddressgivenbyexpression.Wecanfindfollowingcasewherethiscommandlineoptioncanbeuseful:intheLinuxkernelsourcecodeandmorepreciselyintheMakefilethatisrelatedtothekerneldecompressionfortheARMarchitecture-arch/arm/boot/compressed/Makefile,wecanfindfollowingdefinition:
LDFLAGS_vmlinux=--defsym_kernel_bss_size=$(KBSS_SZ)
Aswealreadyknow,itdefinesthe_kernel_bss_sizesymbolwiththesizeofthe.bsssectionintheoutputfile.Thissymbolwillbeusedinthefirstassemblyfilethatwillbeexecutedduringkerneldecompressing:
ldrr5,=_kernel_bss_size
UsefulcommandlineoptionsoftheGNUlinker
LinuxInside
387Linkers
Thenextcommandlineoptionsisthe-sharedthatallowsustocreatesharedlibrary.The-Mor-map<filename>commandlineoptionprintsthelinkingmapwiththeinformationaboutsymbols.Inourcase:
...
...
...
.text0x00000000004003c00x112
*(.text.unlikely.text.*_unlikely.text.unlikely.*)
*(.text.exit.text.exit.*)
*(.text.startup.text.startup.*)
*(.text.hot.text.hot.*)
*(.text.stub.text.*.gnu.linkonce.t.*)
.text0x00000000004003c00x2a/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o
...
...
...
.text0x00000000004003ea0x31main.o
0x00000000004003eamain
.text0x000000000040041b0x3flib.o
0x000000000040041bfactorial
OfcoursetheGNUlinkersupportstandardcommandlineoptions:--helpand--versionthatprintcommonhelpoftheusageoftheldanditsversion.That'sallaboutcommandlineoptionsoftheGNUlinker.Ofcourseitisnotthefullsetofcommandlineoptionssupportedbytheldutil.Youcanfindthecompletedocumentationoftheldutilinthemanual.
AsIwrotepreviously,ldhassupportforitsownlanguage.ItacceptsLinkerCommandLanguagefileswritteninasupersetofAT&T'sLinkEditorCommandLanguagesyntax,toprovideexplicitandtotalcontroloverthelinkingprocess.Let'slookonitsdetails.
Withthelinkerlanguagewecancontrol:
inputfiles;outputfiles;fileformatsaddressesofsections;etc...
Commandswritteninthelinkercontrollanguageareusuallyplacedinafilecalledlinkerscript.Wecanpassittoldwiththe-Tcommandlineoption.ThemaincommandinalinkerscriptistheSECTIONScommand.Eachlinkerscriptmustcontainthiscommandanditdeterminesthemapoftheoutputfile.Thespecialvariable.containscurrentpositionoftheoutput.Let'swriteasimpleassemblyprogramandwewilllookathowwecanusealinkerscripttocontrollinkingofthisprogram.Wewilltakeahelloworldprogramforthisexample:
section.data
msgdb"hello,world!",`\n`
section.text
global_start
_start:
movrax,1
movrdi,1
movrsi,msg
movrdx,14
syscall
movrax,60
movrdi,0
syscall
ControlLanguagelinker
LinuxInside
388Linkers
Wecancompileandlinkitwiththefollowingcommands:
$nasm-felf64-ohello.ohello.asm
$ld-ohellohello.o
Ourprogramconsistsfromtwosections:.textcontainscodeoftheprogramand.datacontainsinitializedvariables.Let'swritesimplelinkerscriptandtrytolinkourhello.asmassemblyfilewithit.Ourscriptis:
/*
*Linkerscriptforthefactorial
*/
OUTPUT(hello)
OUTPUT_FORMAT("elf64-x86-64")
INPUT(hello.o)
SECTIONS
{
.=0x200000;
.text:{
*(.text)
}
.=0x400000;
.data:{
*(.data)
}
}
OnthefirstthreelinesyoucanseeacommentwritteninCstyle.AfterittheOUTPUTandtheOUTPUT_FORMATcommandsspecifythenameofourexecutablefileanditsformat.Thenextcommand,INPUT,specifiestheinputfiletotheldlinker.Then,wecanseethemainSECTIONScommand,which,asIalreadywrote,mustbepresentineverylinkerscript.TheSECTIONScommandrepresentsthesetandorderofthesectionswhichwillbeintheoutputfile.AtthebeginningoftheSECTIONScommandwecanseefollowingline.=0x200000.Ialreadywroteabovethat.commandpointstothecurrentpositionoftheoutput.Thislinesaysthatthecodeshouldbeloadedataddress0x200000andtheline.=0x400000saysthatdatasectionshouldbeloadedataddress0x400000.Thesecondlineafterthe.=0x200000defines.textasanoutputsection.Wecansee*(.text)expressioninsideit.The*symboliswildcardthatmatchesanyfilename.Inotherwords,the*(.text)expressionsaysall.textinputsectionsinallinputfiles.Wecanrewriteitashello.o(.text)forourexample.Afterthefollowinglocationcounter.=0x400000,wecanseedefinitionofthedatasection.
Wecancompileandlinkitwiththe:
$nasm-felf64-ohello.ohello.S&&ld-Tlinker.script&&./hello
hello,world!
Ifwewilllookinsideitwiththeobjdumputil,wecanseethat.textsectionstartsfromtheaddress0x200000andthe.datasectionsstartsfromtheaddress0x400000:
$objdump-Dhello
Disassemblyofsection.text:
0000000000200000<_start>:
200000:b801000000mov$0x1,%eax
...
Disassemblyofsection.data:
0000000000400000<msg>:
400000:68656c6c6fpushq$0x6f6c6c65
...
LinuxInside
389Linkers
Apartfromthecommandswehavealreadyseen,thereareafewothers.ThefirstistheASSERT(exp,message)thatensuresthatgivenexpressionisnotzero.Ifitiszero,thenexitthelinkerwithanerrorcodeandprintthegivenerrormessage.Ifyou'vereadaboutLinuxkernelbootingprocessinthelinux-insidesbook,youmayknowthatthesetupheaderoftheLinuxkernelhasoffset0x1f1.InthelinkerscriptoftheLinuxkernelwecanfindacheckforthis:
.=ASSERT(hdr==0x1f1,"Thesetupheaderhasthewrongoffset!");
TheINCLUDEfilenamecommandallowstoincludeexternallinkerscriptsymbolsinthecurrentone.Inalinkerscriptwecanassignavaluetoasymbol.ldsupportsacoupleofassignmentoperators:
symbol=expression;symbol+=expression;symbol-=expression;symbol*=expression;symbol/=expression;symbol<<=expression;symbol>>=expression;symbol&=expression;symbol|=expression;
AsyoucannotealloperatorsareCassignmentoperators.Forexamplewecanuseitinourlinkerscriptas:
START_ADDRESS=0x200000;
DATA_OFFSET=0x200000;
SECTIONS
{
.=START_ADDRESS;
.text:{
*(.text)
}
.=START_ADDRESS+DATA_OFFSET;
.data:{
*(.data)
}
}
AsyoualreadymaynotedthesyntaxforexpressionsinthelinkerscriptlanguageisidenticaltothatofCexpressions.Besidesthisthecontrollanguageofthelinkingsupportsfollowingbuiltinfunctions:
ABSOLUTE-returnsabsolutevalueofthegivenexpression;ADDR-takesthesectionandreturnsitsaddress;ALIGN-returnsthevalueofthelocationcounter(.operator)thatalignedbytheboundaryofthenextexpressionafterthegivenexpression;DEFINED-returns1ifthegivensymbolplacedintheglobalsymboltableand0inotherway;MAXandMIN-returnmaximumandminimumofthetwogivenexpressions;NEXT-returnsthenextunallocatedaddressthatisamultipleofthegiveexpression;SIZEOF-returnsthesizeinbytesofthegivennamedsection.
That'sall.
Conclusion
LinuxInside
390Linkers
Thisistheendofthepostaboutlinkers.Welearnedmanythingsaboutlinkersinthispost,suchaswhatisalinkerandwhyitisneeded,howtouseit,etc..
Ifyouhaveanyquestionsorsuggestions,writemeanemailorpingmeontwitter.
PleasenotethatEnglishisnotmyfirstlanguage,andIamreallysorryforanyinconvenience.IfyoufindanymistakespleaseletmeknowviaemailorsendaPR.
BookaboutLinuxkernelinternalslinkerobjectfilesglibcopcodeELFGNUlinkerMypostsaboutassemblyprogrammingforx86_64readelf
Links
LinuxInside
391Linkers
Asyoualreadymayknow,I'vestartedaseriesofblogpostsaboutassemblerprogrammingforx86_64architectureinthelastyear.Ihaveneverwrittenalineoflow-levelcodebeforethismoment,exceptforacoupleoftoyHelloWorldexamplesintheuniversity.ItwasalreadyalongtimeagoandasIalreadysaidIdidn'twritelow-levelcodeatall.SometimeagoIwasinterestedinsuchthingsorinotherwordsIunderstoodthatIcanwriteprograms,butactuallyIdidn'tunderstandhowmyprogramisarranged.
AfterwritingsomeassemblercodeIbegantounderstandhowmyprogramlooksaftercompilation,approximately.Butanyway,Ididn'tunderstandmanyotherthings.Forexample:whatoccurswhenthesyscallinstructionisexecutedinmyassembler,whatoccurswhentheprintffunctionstartstoworkorhowcanmyprogramtalkwithothercomputersvianetwork.Assemblerprogramminglanguagedidn'tgivemeanswerstomyquestionsandIdecidedtogodeeperinmyresearch.IstartedtolearnfromthesourcecodeoftheLinuxkernelandtriedtounderstandthethingsthatI'minterestedin.ThesourcecodeoftheLinuxkerneldidn'tgivemetheanswerstoallofmyquestions,butnowmyknowledgeabouttheLinuxkernelandtheprocessesarounditismuchbetter.
I'mwritingthispartnineandahalfmonthsafterI'vestartedtolearnfromthesourcecodeoftheLinuxkernelandpublishedthefirstpartofthisbook.Nowitcontainsfortypartsanditisnottheend.IdecidedtowritethisseriesabouttheLinuxkernelmostlyformyself.AsyouknowtheLinuxkernelisveryhugepieceofcodeanditiseasytoforgetwhatdoesthisorthatpartoftheLinuxkernelmeanandhowdoesitimplementsomething.Butsoonthelinux-insidesrepobecamepopularandafterninemonthsithas9096stars:
ItseemsthatpeopleareinterestedintheinternalsoftheLinuxkernel.Besidesthis,inallthattimethatI'mwritinglinux-inside,Ihavereceivedmanyquestionsfromdifferentpeoplelike:howtostartwiththeLinuxkernel,whatdoIneedtostartcontributetotheLinuxkernelandandotherslikethese.GenerallypeopleareinterestedcontributetoopensourceprojectfordifferentreasonsandtheLinuxkernelisnotexception:
So,seemsthatpeopleareinterestedaboutLinuxkerneldevelopmentprocess.IthoughtitwillbestrangeifthebookabouttheLinuxkernelwillnotcontainapartthatwilldescribehowtotakeapartintheLinuxkerneldevelopmentandthat'swhyIdecidedtowriteit.YouwillnotfindinformationaboutwhyyoushouldbeinterestedincontributingtotheLinuxkernelinthispart.IseemanybenefitstolearnsourcecodeoftheLinuxkernel.Idon'tknowhowaboutyou,that'swhyIhavenoansweronthisquestion.ButifyouareinterestedhowtostartwithLinuxkerneldevelopment,thispartisforyou.
Let'sstart.
Linuxkerneldevelopment
Introduction
HowtostartwithLinuxkernel
LinuxInside
392Linuxkerneldevelopment
Firstofalllet'slookhowtoget,buildandruntheLinuxkernel.ActuallyyoucanrunyourcustombuildoftheLinuxkernelintwoways:
RuntheLinuxkernelonavirtualmachine;RuntheLinuxkernelonrealhardware.
I'llprovidedescriptionsforbothmethods.BeforewewillstarttodosomethingwiththeLinuxkernel,weneedtogetit.Thereareacoupleofwayshowtodoit.Alldependsonyourpurpose.IfyoujustwanttoupdatethecurrentversionoftheLinuxkernelonyourcomputer,youcanusetheinstructionsspecificforyourLinuxdistro.
InthefirstcaseyoujustneedtodownloadnewversionoftheLinuxkernelwiththepackagemanager.Forexample,toupgradetheversionoftheLinuxkernelto4.1forUbuntu(VividVervet),youwilljustneedtoexecutethefollowingcommands:
$sudoadd-apt-repositoryppa:kernel-ppa/ppa
$sudoapt-getupdate
Afterthisexecutethiscommand:
$apt-cacheshowpkglinux-headers
andchoosetheversionoftheLinuxkernelinwhichyouareinterested.Intheendexecutethenextcommandandreplace${version}withtheversionthatyouchoseintheoutputofthepreviouscommand:
$sudoapt-getinstalllinux-headers-${version}linux-headers-${version}-genericlinux-image-${version}-generic--fix-missing
andrebootyoursystem.Aftertherebootyouwillseethenewkernelinthegrubmenu.
IntheotherwayifyouareinterestedintheLinuxkerneldevelopment,youwillneedtogetthesourcecodeoftheLinuxkernel.Youcanfinditonthekernel.orgwebsiteanddownloadanarchivewiththeLinuxkernelsourcecode.ActuallytheLinuxkerneldevelopmentprocessisfullybuiltaroundgitversioncontrolsystem.Soyoucangetitwithgitfromthekernel.org:
$gitclonegit://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Idon'tknowhowaboutyou,butIprefergithub.ThereisamirroroftheLinuxkernelmainlinerepository,soyoucancloneitwith:
[email protected]:torvalds/linux.git
ActuallyI'musingmyforkfordevelopmentandwhenIwanttopullupdatesfromthemainrepositoryIjustexecutethefollowingcommand:
$gitcheckoutmaster
$gitpullupstreammaster
LinuxInside
393Linuxkerneldevelopment
Notethattheremotenameofthemainrepositoryisupstream.Toaddanewremotewiththemainlinuxrepositoryyoucanexecute:
[email protected]:torvalds/linux.git
Afterthisyouwillhavetworemotes:
~/dev/linux(master)$gitremote-v
[email protected]:0xAX/linux.git(fetch)
[email protected]:0xAX/linux.git(push)
upstreamhttps://github.com/torvalds/linux.git(fetch)
upstreamhttps://github.com/torvalds/linux.git(push)
Oneisofyoufork(origin)andthesecondisforthemainrepository(upstream).
NowthatwehavealocalcopyoftheLinuxkernelsourcecode,weneedtoconfigureandbuildit.TheLinuxkernelcanbeconfiguredindifferentways.Thesimplestwayistojustcopytheconfigurationfileofthealreadyinstalledkernelthatislocatedinthe/bootdirectory:
$sudocp/boot/config-$(uname-r)~/dev/linux/.config
IfyourcurrentLinuxkernelwasbuiltwiththesupportforaccesstothe/proc/config.gzfile,youcancopyyouractualkernelconfigurationfilewiththiscommand:
$cat/proc/config.gz|gunzip>~/dev/linux/.config
Ifyouarenotsatisfiedwiththestandardkernelconfigurationthatisprovidedbythemaintainersofyourdistro,youcanconfiguretheLinuxkernelmanually.Thereareacoupleofwaystodoit.TheLinuxkernelrootMakefileprovidesasetoftargetsthatallowsyoutoconfigureit.Forexamplemenuconfigprovidesamenu-driveninterfaceforthekernelconfiguration:
LinuxInside
394Linuxkerneldevelopment
Thedefconfigargumentgeneratesthedefaultkernelconfigurationfileforthecurrentarchitecture,forexamplex86_64defconfig.YoucanpasstheARCHcommandlineargumenttomaketobuilddefconfigforthegivenarchitecture:
$makeARCH=arm64defconfig
Theallnoconfig,allyesconfigandallmodconfigargumentsallowyoutogenerateanewconfigurationfilewherealloptionswillbedisabled,enabledandenabledasmodulesrespectively.ThenconfigcommandlineargumentsthatprovidesncursesbasedprogramwithmenutoconfigureLinuxkernel:
LinuxInside
395Linuxkerneldevelopment
AndevenrandconfigtogeneraterandomLinuxkernelconfigurationfile.IwillnotwritehowtoconfiguretheLinuxkernel,whichoptionstoenableandwhatnot,becauseitmakesnosensetodosofortworeasons:FirstofallIdonotknowyourhardwareandsecond,ifyouknowyourhardware,theonlyremainingtaskistofindouthowtouseprogramsforkernelconfiguration,andallofthemareprettysimpletouse.
Ok,forthismomentwegotthesourcecodeoftheLinuxkernelandconfiguredit.ThenextstepisthecompilationoftheLinuxkernel.ThesimplestwaytocompileLinuxkernelisjustexecute:
$make
scripts/kconfig/conf--silentoldconfigKconfig
#
#configurationwrittento.config
#
CHKinclude/config/kernel.release
UPDinclude/config/kernel.release
CHKinclude/generated/uapi/linux/version.h
CHKinclude/generated/utsrelease.h
...
...
...
OBJCOPYarch/x86/boot/vmlinux.bin
ASarch/x86/boot/header.o
LDarch/x86/boot/setup.elf
OBJCOPYarch/x86/boot/setup.bin
BUILDarch/x86/boot/bzImage
Setupis15740bytes(paddedto15872bytes).
Systemis4342kB
CRC82703414
Kernel:arch/x86/boot/bzImageisready(#73)
command.Toincreasethespeedofkernelcompilationyoucanpass-jNcommandlineargumenttothemakeutil,whereNspecifiesthenumberofcommandstorunsimultaneously:
$make-j8
LinuxInside
396Linuxkerneldevelopment
IfyouwanttobuildLinuxkernelforanarchitecturethatdiffersfromyourcurrent,thesimplestwaytodoitpasstwoarguments:
ARCHcommandlineargumentandthenameofthetargetarchitecture;CROSS_COMPILERcommandlineargumentandthecross-compilertoolprefix;
ForexampleifwewanttocompiletheLinuxkernelforthearm64withdefaultkernelcnofigurationfile,weneedtoexecutefollowingcommand:
$make-j4ARCH=arm64CROSS_COMPILER=aarch64-linux-gnu-defconfig
$make-j4ARCH=arm64CROSS_COMPILER=aarch64-linux-gnu-
Asresultofcompilationwecanseethecompressedkernel-arch/x86/boot/bzImage.Nowwehavecompiledkernelandwecaneitherinstallitonourcomputerorjustrunitinanemulator.
AsIalreadywrotewewillconsidertwowayshowtolaunchnewkernel:InthefirstcasewecaninstallandrunthenewversionoftheLinuxkernelontherealhardwareandthesecondislaunchtheLinuxkernelonavirtualmachine.InthepreviousparagraphwesawhowtobuildtheLinuxkernelfromsourcecodeandasaresultwehavegotcompressedimage:
...
...
...
Kernel:arch/x86/boot/bzImageisready(#73)
AfterwehavegotthebzImageweneedtoinstallheaders,modulesofthenewLinuxkernelwiththe:
$sudomakeheaders_install
$sudomakemodules_install
anddirectlythekernelitself:
$sudomakeinstall
FromthismomentwehaveinstallednewversionoftheLinuxkernelandnowwemusttellthebootloaderaboutit.Ofcoursewecanadditmanuallybytheeditingofthe/boot/grub2/grub.cfgconfigurationfile,butIprefertouseascriptforthispurpose.I'musingtwodiffernetLinuxdistros:FedoraandUbuntu.Therearetwodifferentwaystoupdatethegrubconfigurationfile.I'musingfollowingscriptforthispurpose:
#!/bin/bash
source"term-colors"
DISTRIBUTIVE=$(cat/etc/*-release|grepNAME|head-1|sed-n-e's/NAME\=//p')
echo-e"Distributive:${Green}${DISTRIBUTIVE}${Color_Off}"
if[["$DISTRIBUTIVE"=="Fedora"]];
then
su-c'grub2-mkconfig-o/boot/grub2/grub.cfg'
else
sudoupdate-grub
InstallingLinuxkernel
LinuxInside
397Linuxkerneldevelopment
fi
echo"${Green}Done.${Color_Off}"
ThisisthelaststepofthenewLinuxkernelinstallationandafterthisyoucanrebootyourcomputerandselectnewversionofthekernelduringboot.
ThesecondcaseistolaunchnewLinuxkernelinthevirtualmachine.Ipreferqemu.Firstofallweneedtobuildinitialramdisk-initrdforthis.TheinitrdisatemporaryrootfilesystemthatisusedbytheLinuxkernelduringinitializationprocesswhileotherfilesystemsarenotmounted.Wecanbuildinitrdwiththefollowingcommands:
Firstofallweneedtodownloadbusyboxandrunmenuconfigforitsconfiguration:
$mkdirinitrd
$cdinitrd
$curlhttp://busybox.net/downloads/busybox-1.23.2.tar.bz2|tarxjf-
$cdbusybox-1.23.2/
$makemenuconfig
$make-j4
Thebusyboxisanexecutablefile-/bin/busyboxthatcontainsasetofstandardtoolslikecoreutilsandetc.Inthebusysboxmenuweneedtoenable:BuildBusyBoxasastaticbinary(nosharedlibs)option:
Wecanfindthismenuinthe:
BusyboxSettings
-->BuildOptions
Afterthisweexitfromthebusysboxconfigurationmenuandexecutefollowingcommandsforbuildingandinstallationofit:
LinuxInside
398Linuxkerneldevelopment
$make-j4
$sudomakeinstall
Ok,thebusyboxisinstalledfromthismomentandwecanstarttobuildourinitrd.Dodothis,wegotothepreviousinitrddirectoryand:
$cd..
$mkdir-pinitramfs
$cdinitramfs
$mkdir-pv{bin,sbin,etc,proc,sys,usr/{bin,sbin}}
$cp-av../busybox-1.23.2/_install/*.
copybusyboxfieldstothebin,sbinandotherdirectories.Nowweneedtocreateexecutableinitfilethatwillbeexecutedasafirstprocessinthesystem.Myinitfilejustmountsprocfsandsysfsfilesystemsandexecutedshell:
#!/bin/sh
mount-tprocnone/proc
mount-tsysfsnone/sys
exec/bin/sh
Nowwecancreateanarchivethatwillbeourinitrd:
$find.-print0|cpio--null-ov--format=newc|gzip-9>~/dev/initrd_x86_64.gz
Wecannowrunourkernelinthevirtualmachine.AsIalreadywroteIpreferqemuforthis.Wecanrunourkernelwiththefollowingcommand:
$qemu-system-x86_64-snapshot-m8GB-serialstdio-kernel~/dev/linux/arch/x86_64/boot/bzImage-initrd~/dev/initrd_x86_64.gz-append"root=/dev/sda1ignore_loglevel"
LinuxInside
399Linuxkerneldevelopment
FromnowwecanruntheLinuxkernelinthevirtualmachineandthismeansthatwecanbegintochangeandtestthekernel.
Considerusingivandaviov/minimaltoautomatetheprocessofgeneratinginitrd.
Themainpointofthisparagraphisanswerontwoquestions:WhattodoandwhatnottodobeforeyouwillsendyourfirstpatchtotheLinuxkernel.Please,donotconfusethistodowithtodo.IhavenoanswerwhatyoucanfixintheLinuxkernel.IjustwanttotellyoumyworkflowduringexperimentingwiththeLinuxkernelsourcecode.
FirstofallI'mtryingtopulllastupdatesfromtheLinus'srepowiththefollowingcommands:
$gitcheckoutmaster
$gitpullupstreammaster
AfterthismylocalrepositorywiththeLinuxkernelsourcecodeissyncedwiththemainlinerepository.Nowwecanmakesomechangesinthesourcecode.AsIalreadywrote,IhavenoadviceforyouwhereyoucanstartandwhatTODOintheLinuxkernel.Butthebestplacefornewbiesisstagingtree.Inotherwordsthesetofdriversfromthedrivers/staging.ThemaintainerofthestagingtreeisGregKroah-Hartmanandthestagingtreeisthatplacewhereyourtrivialpatchcanbeaccepted.Let'slookonasimpleexamplethatdescribeshowtogeneratepatch,checkitandsendtotheLinuxkernelmaillisting.
IfwewilllookonthedriverfortheDigiInternationalEPCAPCIbaseddevices,wewillseedgap_sindexfunction:
staticchar*dgap_sindex(char*string,char*group)
{
char*ptr;
if(!string||!group)
GettingstartedwiththeLinuxKernelDevelopment
LinuxInside
400Linuxkerneldevelopment
returnNULL;
for(;*string;string++){
for(ptr=group;*ptr;ptr++){
if(*ptr==*string)
returnstring;
}
}
returnNULL;
}
onthe295line.Thisfunctionlooksforamatchofanycharacterinthegroup,andreturnsthatposition.DuringresearchofsourcecodeoftheLinuxkernel,Ihavenotedthatlib/string.csourcecodefilecontainsimplementationofthestrpbrkfunctionthatdoesthesamethatdgap_sinidex.Itisnotagoodideatouseacustomimplementationofafunctionthatalreadyexists.Sowecanremovethedgap_sindexfunctionfromthedrivers/staging/dgap/dgap.csourcecodefileandusethestrpbrkinstead.
Firstofalllet'screatenewgitbranchbasedonthecurrentmasterthatsyncedwiththeLinuxkernelmainlinerepo:
$gitcheckout-b"dgap-remove-dgap_sindex"
Andnowwecanreplacethedgap_sindexwiththestrpbrk.AfterwedidallchangesweneedtorecompiletheLinuxkernelorjustdgapdirectory.Donotforgettoenablethisdriverinthekernelconfiguration.Youcanfinditinthe:
DeviceDrivers
-->Stagingdrivers
---->DigiEPCAPCIproducts
Nowistimetomakecommit.I'musingfollowingcombinationforthis:
LinuxInside
401Linuxkerneldevelopment
$gitadd.
$gitcommit-s-v
Afterthelastcommandaneditorwillbeopennedthatwillbechosenfrom$GIT_EDITORor$EDITORenvironmentvariable.The-scommandlineargumentwilladdSigned-off-bylinebythecommitterattheendofthecommitlogmessage.Youcanfindthislineintheendofeachcommitmessage,forexample-00cc1633.Themainpointofthislineisthetrackingofwhodidachange.The-voptionshowunifieddiffbetweentheHEADcommitandwhatwouldbecommittedatthebottomofthecommitmessage.Itisnotnecessary,butveryusefulsometimes.Acoupleofwordsaboutcommitmessage.Actuallyacommitmessageconsistsfromtwoparts:
Thefirstpartisonthefirstlineandcontainsshortdescriptionofchanges.Itstartsfromthe[PATCH]prefixfollowedbyasubsystem,driverorarchitecturenameandafter:symbolshortdescription.Inourcaseitwillbesomethinglikethis:
[PATCH]staging/dgap:Usestrpbrk()insteadofdgap_sindex()
Aftershortdescriptionusuallywehaveanemptylineandfulldescriptionofthecommit.Inourcaseitwillbe:
The<linux/string.h>providesstrpbrk()functionthatdoesthesamethatthe
dgap_sindex().Let'susealreadydefinedfunctioninsteadofwritingcustom.
AndtheSign-off-bylineintheendofthecommitmessage.Notethateachlineofacommitmessagemustnobelongerthan80symbolsandcommitmessagemustdescribeyourchangesindetails.Donotjustwriteacommitmessagelike:Customfunctionremoved,youneedtodescribewhatyoudidandwhy.Thepatchreviewersmustknowwhattheyreview.Besidesthiscommitmessagesinthisviewareveryhelpful.Eachtimewhenwecan'tunderstandsomething,wecanusegitblametoreaddescriptionofchanges.
Afterwehavecommittedchangestimetogeneratepatch.Wecandoitwiththeformat-patchcommand:
$gitformat-patchmaster
0001-staging-dgap-Use-strpbrk-instead-of-dgap_sindex.patch
We'vepassednameofthebranch(masterinthiscase)totheformat-patchcommandthatwillgenerateapatchwiththelastchangesthatareinthedgap-remove-dgap_sindexbranchandnotareinthemasterbranch.Asyoucannote,theformat-patchcommandgeneratesfilethatcontainslastchangesandhasnamethatisbasedonthecommitshortdescription.Ifyouwanttogenerateapatchwiththecustomname,youcanuse--stdoutoption:
$gitformat-patchmaster--stdout>dgap-patch-1.patch
ThelaststepafterwehavegeneratedourpatchisjusttosendittotheLinuxkernelmaillisting.Ofcourseyoucanuseanyemailclient,buttheGitprovidesspecialcommandforthis:gitsend-email.Beforeyouwillsendyourpatch,youneedtoknowwheretosendit.Yes,youcansenditjusttotheLinuxkernelmaillistingaddresswhichislinux-kernel@vger.kernel.org,butthereisahighprobabilitythatthepatchwillbeignored,becauseasyoumayalreadyknowthereisthelargeflowofmessagesontheLinuxkernelmaillisting.Thebetterwaywillbesendtoamaintainerofsubsystemwhereyouhavemadechanges.Wecanfindmaintainerandotherrelatedguyswhohastouchedthecodewiththeget_maintainer.plscript.Allofyouneedisjustpassfileordirectorywhereyouwroteacode.GototherootdirectorywithsourcecodeoftheLinuxkernelandexecuteit:
$./scripts/get_maintainer.pl-fdrivers/staging/dgap/dgap.c
LidzaLouina<[email protected]>(maintainer:DIGIEPCAPCIPRODUCTS)
LinuxInside
402Linuxkerneldevelopment
MarkHounschell<[email protected]>(maintainer:DIGIEPCAPCIPRODUCTS)
DaeseokYoun<[email protected]>(maintainer:DIGIEPCAPCIPRODUCTS)
GregKroah-Hartman<[email protected]>(supporter:STAGINGSUBSYSTEM)
[email protected](openlist:DIGIEPCAPCIPRODUCTS)
[email protected](openlist:STAGINGSUBSYSTEM)
[email protected](openlist)
Youwillseethesetofthenamesandrelatedemails.Nowwecansendourpatchwith:
$gitsend-email--to"LidzaLouina<[email protected]>"\
--cc"MarkHounschell<[email protected]>"\
--cc"DaeseokYoun<[email protected]>"\
--cc"GregKroah-Hartman<[email protected]>"\
--cc"[email protected]"\
--cc"[email protected]"\
--cc"[email protected]"
That'sall.ThepatchissentandnowonlyhavetowaitfeedbackfromtheLinuxkerneldevelopers.Afteryouwillsentapatchandamaintaineracceptedit,youwillfinditinthemaintainer'srepository(forexamplepatchthatyousawinthispart)andaftersometimeamaintainerwillsendpullrequesttoLinusandyouwillseeyourpatchinthemainlinerepository.
That'sall.
IntheendofthispartIwanttogiveyousomeadvicethatwilldescribewhattodoandwhatnottododuringdevelopmentoftheLinuxkernel:
Think,Think,Think.Andthinkagainbeforeyoudecidedtosendapatch.
EachtimewhenyouhavechangedsomethingintheLinuxkernelsourcecode-compileit.Afteranychanges.Againandagain.Nobodylikeschangesthatdon'tevencompile.
TheLinuxkernelhasacodingstyleguideandyouneedtocomplywithit.Thereisgreatscriptwhichcanhelptocheckyourchanges.Thisscriptis-scripts/checkpatch.pl.Justpasssourcecodefilewithchangestoitandyouwillsee:
$./scripts/checkpatch.pl-fdrivers/staging/dgap/dgap.c
WARNING:Blockcommentsuse*onsubsequentlines
#94:FILE:drivers/staging/dgap/dgap.c:94:
+/*
+SUPPORTEDPRODUCTS
CHECK:spacespreferredaroundthat'|'(ctx:VxV)
#143:FILE:drivers/staging/dgap/dgap.c:143:
+{PPCM,PCI_DEV_XEM_NAME,64,(T_PCXM|T_PCLITE|T_PCIBUS)},
Alsoyoucanseeproblematicplaceswiththehelpofthegitdiff:
Someadvice
LinuxInside
403Linuxkerneldevelopment
Linusdoesn'tacceptgithubpullrequests
Ifyourchangeconsistsfromsomedifferentandunrelatedchanges,youneedtosplitthechangesviaseparatecommits.Thegitformat-patchcommandwillgeneratepatchesforeachcommitandthesubjectofeachpatchwillcontainavNprefixwheretheNisthenumberofthepatch.Ifyouareplanningtosendaseriesofpatchesitwillbehelpfultopassthe--cover-letteroptiontothegitformat-patchcommand.Thiswillgenerateanadditionalfilethatwillcontainthecoverletterthatyoucanusetodescribewhatyourpatchsetchanges.Itisalsoagoodideatousethe--in-reply-tooptioninthegitsend-emailcommand.Thisoptionallowsyoutosendyourpatchseriesinreplytoyourcovermessage.Thestructureoftheyourpatchwilllooklikethisforamaintainer:
|-->coverletter
|---->patch_1
|---->patch_2
Youneedtopassmessage-idasanargumentofthe--in-reply-tooptionthatyoucanfindintheoutputofthegitsend-email:
It'simportantthatyouremailbeintheplaintextformat.Generally,send-emailandformat-patchareveryusefulduringdevelopment,solookatthedocumentationforthecommandsandyou'llfindsomeusefuloptionssuchas:gitsend-emailandgitformat-patch.
Donotbesurprisedifyoudonotgetanimmediateanswerafteryousendyourpatch.Maintainerscanbeverybusy.
ThescriptsdirectorycontainsmanydifferentusefulscriptsthatarerelatedtoLinuxkerneldevelopment.Wealreadysawtwoscriptsfromthisdirectory:thecheckpatch.plandtheget_maintainer.plscripts.Outsideofthosescripts,youcanfindthestackusagescriptthatwillprintusageofthestack,extract-vmlinuxforextractinganuncompressedkernelimage,andmanyothers.OutsideofthescriptsdirectoryyoucanfindsomeveryusefulscriptsbyLorenzoStoakesforkerneldevelopment.
SubscribetotheLinuxkernelmailinglist.Therearealargenumberofletterseverydayonlkml,butitisveryusefultoreadthemandunderstandthingssuchasthecurrentstateoftheLinuxkernel.OtherthanlkmltherearesetmailinglistingswhicharerelatedtothedifferentLinuxkernelsubsystems.
IfyourpatchisnotacceptedthefirsttimeandyoureceivefeedbackfromLinuxkerneldevelopers,makeyourchangesandresendthepatchwiththe[PATCHvN]prefix(whereNisthenumberofpatchversion).Forexample:
[PATCHv2]staging/dgap:Usestrpbrk()insteadofdgap_sindex()
LinuxInside
404Linuxkerneldevelopment
Alsoitmustcontainchangelogthatwilldescribeallchangeschangesfrompreviouspatchversions.Ofcourse,thisisnotanexhaustivelistofrequirementsforLinuxkerneldevelopment,butsomeofthemostimportantitemswereaddressed.
HappyHacking!
IhopethiswillhelpothersjointheLinuxkernelcommunity!Ifyouhaveanyquestionsorsuggestions,writemeatemailorpingmeontwitter.
PleasenotethatEnglishisnotmyfirstlanguage,andIamreallysorryforanyinconvenience.IfyoufindanymistakespleaseletmeknowviaemailorsendaPR.
blogpostsaboutassemblyprogrammingforx86_64Assemblerdistropackagemanagergrubkernel.orgversioncontrolsystemarm64bzImageqemuinitrdbusyboxcoreutilsprocfssysfsLinuxkernelmaillistingarchiveLinuxkernelcodingstyleguideHowtoGetYourChangeIntotheLinuxKernelLinuxKernelNewbiesplaintext
Conclusion
Links
LinuxInside
405Linuxkerneldevelopment
Linux/x86bootprotocolLinuxkernelparameters
64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
8250UARTProgrammingSerialportsonOSDEV
VideoGraphicsArray(VGA)
IOportprogramming
GCCtypeattributesAssemblerDirectives
task_structdefinition
PowerPCandLinuxKernelInside
Linuxx86ProgramStartUpMemoryLayoutinProgramExecution(32bits)
Usefullinks
Linuxboot
Protectedmode
Serialprogramming
VGA
IO
GCCandGAS
Importantdatastructures
Otherarchitectures
Usefullinks
LinuxInside
406Usefullinks
AkashShendeJakubKramarzckroossecksunMaciekMakowskiThomasMarcelisChrisCostesnathansozRubanDeventhiranfuzhliandarsAlexandruPanaBogdanRădulescuzilcodelittgulyasmalx741HaddaynDanielCampoverdeCarriónGuillaumeGomezLeandroMoreiraJonatanPålssonGeorgeHorrellCiroSantilliKevinSoulesFabioPozziKevinSwintonLeandroMoreiraLYF610400210CamCopeMiquelSabatéSolàMichaelAquilinaGabrielSulliceMichaelDrüingAlexanderPolakovAntonDavydovArpanKapoorBrandonFosdickAshleighNewman-JonesTerrellRussellMarioEwoudKohlvanWijngaardenJochenMaesBrother-LalBrianMcKennaJoshTriplettJamesFlowersAlexanderHardingDzmitryPlashchynski
Thankyoutoallcontributors:
LinuxInside
407Contributors
SimarpreetSinghumatombaVaibhavTulsyanBrandonWamboldtMaximeLeboeufMaximilienRichermarmelademaAnisseAstierTheCodeArtistEhsunNAdamShannonDonnyNadolnyEhsunNWaqarAhmedIanMiellDongLiangMuJohanManuelBrianRakRobinPeiremansxiaoqiangzhaoaoueleteDennisBirkholzAntonTyurinBogdanKulbida
LinuxInside
408Contributors