ErikBernhardsson
Thehalf-lifeofcode&theshipofTheseus2016-12-05
Asaprojectevolves,doesthenewcodejustaddontopoftheoldcode?Ordoes
itreplacetheoldcodeslowlyovertime?Inordertounderstandthis,Ibuilta
littlethingtoanalyzeGitprojects,withhelpfromtheformidableGitPython
project.Theideaistogobackinhistoryhistoricalandruna gitblame(makingthissomewhatfastwasabitnontrivial,asitturnsout,butI’llspare
youthedetails,whichinvolvesomeopportunisticcachingoffiles,pick
historicalpointsspreadoutintime,use gitdiff toinvalidatechangedfiles,
etc).
Inmomentofclarity,Inamed“GitofTheseus”asaterriblepunonshipof
Theseus.I’madadnow,soIcanmaketerriblepuns.Itreferstoaphilosophical
paradox,wherethepiecesofashiparereplacedforhundredsofyears.Ifall
piecesarereplaced,isitstillthesameship?
TheshipwhereinTheseusandtheyouthofAthensreturnedfromCretehad
thirtyoars,andwaspreservedbytheAtheniansdowneventothetimeof
DemetriusPhalereus,fortheytookawaytheoldplanksastheydecayed,
puttinginnewandstrongertimberintheirplaces,insomuchthatthisship
becameastandingexampleamongthephilosophers,forthelogicalquestion
ofthingsthatgrow;onesideholdingthattheshipremainedthesame,andthe
othercontendingthatitwasnotthesame.
Itturnsoutthatcodedoesn’texactlyevolvethewayIexpected.Thereisa“ship
ofTheseus”effect,butthere’salsoacompoundingeffectwherecodebaseskeep
growingovertime(maybeIshouldcallit“SecondAvenueSubway”effect,after
theconstructionprojectinNYCthat’sbeengoingonsince1919).
Let’sstartbyanalyzingGititself.Gitbecameself-hostingearlyon,andit’sone
ofthemostpopularandoldestGitprojects:
Thisplotstheaggregatenumberoflinesofcodeovertime,brokendowninto
cohortsbytheyearadded.Iwouldhaveexpectedmoreofadecayhere,andI’m
surprisedtoseethatsomuchcodewrittenbackin2006isstillaliveinthecode
base—interesting!
Wecancomputethedecayforindividualcommitstoo.Ifwealignallcommits
atx=0,wecanlookattheaggregatedecayforcodeinacertainrepo.This
analysisissomewhathardertoimplementthanitsoundslikebecauseof
variousstuff(mostlybecausenewercommitshavehadlesstime,sotheright
endofthecurverepresentsanaggregateoffewercommits).
ForGit,thisplotlookslikethis:
Evenafter10years,40%oflinesofcodeisstillpresent!Let’slookatabroader
rangeof(somewhatrandomlyselected)opensourceprojects:
ItlookslikeGitissomewhatofanoutlierhere.Fittinganexponentialdecayto
Gitandsolvingforthehalf-lifegivesapprox~6years.
Hmm…notconvincedthisisnecessarilyaperfectfit,butasthefamousquote
goes:Allmodelsarewrong,somemodelsareuseful.Iliketheexplanatory
powerofanexponentialdecay—codehasanexpectedlifetimeandaconstant
riskofbeingreplaced.
Isuspectaslightlybettermodelwouldbetofitasumofexponentials.This
wouldworkforarepowithsomecodethatchangesfastandsomecodethat
changesslowly.Butbeforegoingdownarabbitholeofcurvefitting,Ireminded
myselfofvonNeumann’squote:WithfourparametersIcanfitanelephant,
andwithfiveIcanmakehimwigglehistrunk.There’sprobablysomewayto
makeitwork,butI’llrevisitsomeothertime.
Let’slookatalotofprojectsinaggregate(alsosampledsomewhatarbitrarily):
Inaggregate,thehalf-lifeisroughly~3.33years.Ilikethat,it’saneasynumber
toremember.Butthespreadisbigbetweendifferentprojects.Theaggregate
modeldoesn’tnecessarilyhavesuperstrongpredictivepower—it’shardto
pointtoaarbitraryopensourceprojectandexpecthalfofittobegone3.33
yearslater.
MoarreposApache(akaHTTPD)isanotherrepothatgoeswayback:
Rails:
Beautifulexponentialfit!
Node
Wannarunitforyourownrepo?Again,codeisavailablehere.
ThemonsterrepoofthemallNotethatmostoftheserepostookatmostafewminutestoanalyze,usingmy
script.AsafinaltestIdecidedtorunitovertheLinuxkernelwhichisHUGE—
635,229commitsasoftoday.Thisis16timeslargerthanthesecondbiggest
repoIlookedat(rails)andtookmultipledaystoanalyzeonmyshitty
computer.TomakeitfasterIendedupcomputingthefull gitblame onlyfor
commitsspreadoutatleast3weeksandalsolimiteditto .c files:
Thesquigglylinesareprobablyfromthesamplingmechanism.Butlookatthis
beauty—awhopping16Mlines!Thecodecontributionfromeachyear’scohort
isextremelysmoothatthisscale.Individualcommitshaveabsolutelyno
meaningatthisscale—theycumulativesumofthemisverypredictible.It’s
likegoingfromNewton’slawstothermodynamics.
Linuxalsoclearlyexhibitsmoreofalineargrowthpattern.I’mspeculatingthat
thishastodowithitshighmodularity.The drivers directoryhasbyfarthe
mostnumberoffiles(22,091)followedby arch (17,967)whichcontains
supportforvariousarchitectures.Thisisexactlythekindofthingsyouwould
expecttoscaleverywellwithcomplexity,sincetheyhaveawelldefined
interface.
Somewhatofftopic,butIlikethenotionofhowwellaprojectsscaleswith
complexity.Alinearscalabilityistheultimategoal,whereeachonemarginal
featuretakesroughlythesameamountofcode.Badprojectsscale
superlinearly,andeverymarginalfeaturetakesmoreandmorecode.
It’sinterestingtogobackandcontrastLinuxtosomethinglikeAngular,which
basicallyexhibitstheoppositebehavior:
Thehalf-lifeofarandomlyselectedlineinAngularisabout0.32years.Does
thisreflectonAngular?Isthearchitecturebasicallynotas“linear”and
consistent?Youmightsaythecomparisonisunfair,becauseAngularisnew.
That’safairpoint.ButIwouldn’tbesurprisedifitdoesreflectonsome
questionabledesign.Don’tmeantobeshittingonAngularhere,butit’san
interestingcontrast.
Half-lifebyrepositoryAsomewhatarbitrarysampleofprojectsandtheirhalf-lifes:
project half-life(years) firstcommit
angular 0.32 2014
bluebird 0.56 2013
kubernetes 0.59 2014
keras 0.69 2015
tensorflow 1.08 2015
express 1.23 2009
scikit-learn 1.29 2011
luigi 1.30 2012
backbone 1.48 2010
ansible 1.52 2012
react 1.66 2013
node 1.76 2009
underscore 1.97 2009
requests 2.10 2011
rails 2.43 2004
django 3.38 2005
theano 3.71 2008
numpy 4.15 2006
moment 4.54 2015
scipy 4.62 2007
tornado 4.80 2009
redis 5.20 2010
flask 5.22 2010
httpd 5.38 1999
git 6.04 2005
chef 6.18 2008
linux 6.60 2005
It’sinterestingthatmomenthassuchhighhalf-life,butthereasonisthatso
muchofthecodeislocale-specific.Thiscreatesamorelinearscalabilitywitha
stablecoreofcodeandlinearadditionsovertime.expressisanoutlierinthe
otherdirection.It’s7yearsoldbutcodechangesextremelyquickly.I’m
guessingthisispartlybecause(a)lackoflinearscalabilityincode(b)it’s
probablyoneofthefirstmajorJavascriptopensourceprojectstohit
mainstream/popularity,surfingontheNode.jswave.Possiblythecodebase
alsosucks,butIhavenoidea
Hascodingchanged?Icanthinkofthreereasonswhythere’ssuchastrongrelationshipbetweenthe
yeartheprojectwasinitiated,andthehalf-life
1. Codechurnsmoreearlyoninprojects,andbecomesmorestableawhilein
2. Codinghaschangedfrom2006to2016,andmodernprojectsevolvefaster
3. There’ssomekindofselectionbiaswheretheonlyprojectsthatsurviveare
thescalablestablesones
Interestingly,Idon’tfindanyclearevidenceof#1inthedata.Thehalf-lifefor
codewrittenearlierinoldprojectsareashighaslatecode.I’mskepticalabout
#3aswellbecauseIdon’tseewhytherewouldbearelationbetweensurvival
andcodestructure(butmaybethereis).Myconclusionisthatwritingcode
hasfundamentallychangedinthelast10years.Codereallyseemsto
changeatamuchfasterrateinmodernprojects.
Bytheway,seediscussiononHackerNewsandonReddit!
Relatedposts
NYCsubwaymath2016-04-04
Nearestneighbormethodsandvectormodels–part12015-09-23
RecurrentNeuralNetworksforCollaborativeFiltering2014-06-28
Howtobuildupadatateam(everythingIeverlearnedaboutrecruiting)2014-
06-08
InterviewwithaDataScientist:ErikBernhardsson2015-10-27
Paretoefficency2016-10-25
Analyzing50kfontsusingdeepneuralnetworks2016-01-20
©2016.Allrightsreserved.
Loading[Contrib]/a11y/accessibility-menu.js