Measuring correlation between commit frequency and ...

IN DEGREE PROJECT COMPUTER ENGINEERING,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2017

Measuring correlation between commit frequency and popularity on GitHub

JONATHAN JEFFORD-BAKER

MÅRTEN GRÖNLUND

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Measuring correlation between commitfrequency and popularity on GitHub

JONATHAN JEFFORD-BAKERMARTEN GRONLUND

Supervisor: Roberto GuancialeExaminer: Orjan EkebergSwedish title: Matning av korrelation mellan commitfrekvens och popularitet paGitHubStockholm 2017

School of Computer Science and CommunicationKungliga Tekniska Hogskolan

Abstract

This thesis studies the correlation between the commit frequency andpopularity of Github projects. Over 12 000 projects were retrieved usingthe Github API, resulting in a dataset containing 85 projects after fil-tering out projects that were deemed unfit. The analysis of the projectsconsisted of calculating the Pearson Correlation Coefficient using the fre-quency of commits and popularity as variables. Different time intervalswere studied along with several metrics of popularity based upon theproject’s metadata retrieved from Github. The results varied for thedifferent time intervals and metrics of popularity but none of the mea-surements resulted in a correlation coefficient which indicated a strong ormoderate correlation. Therefore this study reached the conclusion of noexisting correlation between commit frequency and popularity. Althoughno correlation was found, several potential measures of improvement forfurther research were discovered.

1

Sammanfattning

Denna studie undersoker korrelationen mellan frekvensen av commitsoch popularitet hos Github projekt. Over 12 000 projekt utvanns genomGithub API:et vilket resulterade i en datamangd innehallandes 85 projektefter att gallringen av oonskade projekt agt rum. Analysen av projektenbestod av att berakna Pearsons korrelationskoefficient med frekvensen avcommits och popularitet som variabler. Baserat pa projektens metadatafran Github undersoktes olika tidsintervall kombinerat med flera matt papopularitet. Resultaten varierade for de olika tidsintervallen och pop-ularitetsmatten men ingen av matningarna resulterade i en korrelation-skoefficient som indikerade en stark eller medelstark korrelation. Saledesfaststallde denna studie slutsatsen att ingen korrelation existerade mellanfrekvensen av commits och popularitet. Trots att ingen korrelation hit-tades, upptacktes daremot flera potentiella forbattringsatgarder for vidareforskning.

2

Contents

1 Introduction 41.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 72.1 Github . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Github Terminology . . . . . . . . . . . . . . . . . . . . . 72.2 Open Source Development . . . . . . . . . . . . . . . . . . . . . . 82.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 P-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Pearson Correlation Coefficient . . . . . . . . . . . . . . . 92.3.3 Spearman’s Rank Correlation Coefficient . . . . . . . . . 92.3.4 Mann-Whitney U-test . . . . . . . . . . . . . . . . . . . . 92.3.5 The Model Used in the Study . . . . . . . . . . . . . . . . 10

2.4 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Method 123.1 Scraping Github . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Metrics and Limits . . . . . . . . . . . . . . . . . . . . . . 133.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Results 154.1 First Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Second Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Third Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4 Fourth Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 174.5 Fifth Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Discussion 205.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 205.1.2 Abnormalities and Extreme Cases . . . . . . . . . . . . . 21

5.2 Possible Error Sources . . . . . . . . . . . . . . . . . . . . . . . . 215.2.1 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . 215.2.2 Similarities between Metrics of Popularity . . . . . . . . . 225.2.3 Other Metrics of Popularity . . . . . . . . . . . . . . . . . 235.2.4 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Conclusions 26

3

1 Introduction

Online code repositories have lead to a new way of developing open sourcesoftware by making the source code from various projects available over theinternet and offering features such as bug tracking and task management. Theavailability of the source code allows people from all over the world to participatein different software projects and allowing users to review the source code easily.Some online code repositories have implemented version control systems whichsupplies metadata about the project such as: the changes that have been made,which user that implemented the change, reported bugs or requested featuresand more.

The code may be used by everything from a couple of users to the thousands,depending on how popular the project is. However, this metadata is seldomutilised in general, even though it can bring insight to different aspects of theproject, as for example how quick bugs in the code are attended to. Thereexists some websites that try to visualise or in other ways analyse this data forresearch but utilisation of this valuable resource is important and needs to berecognised to greater extent.

In the vast number of repositories resides a large number of open sourceprojects but which of these projects are actually worth investigating? For ex-ample, from an investor’s point of view this question is highly important, whichproject should they choose to invest time and possibly money in? A commonreference for us humans is popularity, as being herd animals, humans tend tofollow the herd which also is the case when it comes to software. Users tendto explore and use the software that is popular therefore a popular softwareproject would be a good candidate for the investor.

Popularity is not very easy to measure though especially in the case of opensource software (OSS) projects. Some projects are small in size with a fewcontributors and users of the software whilst other projects have thousands ofcontributors and hundreds of thousands in their userbase. There also existsprojects with a few contributors but the usage of the software is immense, anexample being Open SSL which has 15 developers and is used on millions ofwebsites12. It would therefore be inaccurate to measure popularity using thenumber of contributors as metric since the software still could be widely used.

1.1 Purpose

By analysing the metadata extracted from a large amount of code repositoriesone can develop a general image of the development process and detect certainpatterns in it which have positive or negative consequences on the project’s out-come. Detecting these similarities or differences between the different projectscould perhaps result in conclusions which can benefit the evolution of modelsfor software development.

The purpose of this study is to extract the metadata from a large quantityof repositories and based on that data determine its effects on the softwareproject. The focus will lie on a data type called “commits” which essentiallyare the changes that have been made to the project and occur in repositorieswhich have implemented the Git version control language.

1https://www.openssl.org/community/team.html2https://trends.builtwith.com/Server/OpenSSL

4

1.2 Problem Statement

Commits provide valuable information about the project since they togetherconsist of the complete history of changes made to the code. For a projectto progress, changes have to be made for the code to evolve, in other wordscommits have to be submitted for the project to move forward.

There exists a huge number of open source software projects but far from allof them are popular, in fact many projects die out after a short period of time.There have been earlier studies which have investigated what makes an opensource project popular. Aggarwal et al. [1] focused on the correlation betweendocumentation evolution and popularity and Bissyande et al. [2] studied theconnection between reported issues and project success. However, there hasnot been a study where the correlation between commits and project popular-ity was investigated and since many projects are unpopular it is interesting tostudy if the frequency of commits in certain periods of a project’s lifecycle has animpact on its popularity. Therefore the problem statement for this study will be:

How does the frequency of commits over a project’s lifetime correlate to itspopularity?

1.2.1 Hypotheses

The beginning and end of a project might give insight to how the project be-came popular or how well the project has kept its popularity until the end (orcurrent state). A project that has a high initial activity might interest others tocontribute and build a userbase quickly. A project that also has a high activityduring the end of the project (or is currently active) shows that the userbase islarge enough for a continuous development.

Hypothesis 1 - A high frequency of commits in the beginning of a project willgive it a good start and therefore gain popularity throughout the project’s lifetime.

Hypothesis 2 - A high commit frequency in the end of a project implies thatthe project is/was currently very active, thus popular as well.

A high frequency of commits on average throughout the project would result inhigh popularity. A high commit frequency on average would indicate an activeproject overall, even though the activity may fluctuate between different peri-ods, resulting in a popular project.

Hypothesis 3 - A high average frequency of commits throughout the projectshould result in high popularity.

A project with a long period of continuous weekly activity is believed to poten-tially be popular, since such a period would indicate consistent interest in theproject, therefore making it popular. In contrast a project with a large periodwithout any activity in regard of commits is believed to be unpopular.

Hypothesis 4 - A high number of weeks in a row containing commits im-plies a popular project.

5

Hypothesis 5 - A high number of weeks in a row containing no commit activityimplies an unpopular project.

1.3 Scope

There are several online repository services available, however, this study willfocus on the Github repository. There are different types of repositories atGithub, the Public repository is available for anyone to browse and is the mostcommonly used since it is free of charge unlike the others. Not all types of repos-itories are publicly available, the Personal repository as well as the Enterpriserepositories are private and only accessible if the owner has allowed access. Thisstudy will therefore only involve public repositories. So called “forks” which arerepositories that have been cloned from the original project but continued beingdeveloped independently will be disregarded and only the original projects willbe studied. Another limitation concerns the size of a project, any project repos-itory of a size below 500 bytes will be disregarded from the dataset. Programlanguage may affect the popularity of a project because of the popularity of thelanguage itself, this study will not consider this relation and all projects will betreated equally in terms of program language.

1.4 Outline

Section 2 provides an introduction to Github and development of open sourcesoftware. Furthermore, the section explains the principles of correlation anddiscusses several correlation models. Additionally, the section also introducesthe previous research in the same area as this study. Section 3 discusses andmotivates the usage of dataset, metrics and time intervals as well as the proce-dure of analysing the data. The results are presented in section 4 and discussedin section 5, including possible sources of error along with considerations forfuture research. Lastly, the conclusions are presented in section 6.

6

2 Background

This section provides the background necessary to understand the researchmethodology in this study. Section 2.1 gives an overview of Github and re-lated terminology used throughout the report. Open source development isdiscussed in 2.2. An overview of correlation along with models for measuringit is provided in section 2.3. Lastly, previous research relevant for this study ispresented in section 2.4.

2.1 Github

Github is an online code repository for software development projects whichhas become immensely popular over time compared to similar platforms suchas SourceForge and Google Code. This popularity is linked to the featuresGithub offers such as the ability to fork a project, issue tracking and to watchprojects [1]. In other words the Github offers a platform which eases the socialinteraction in the development process between the participants, something forexample SourceForge lacked when Github started off, which caused Github tosurpass SourceForge [3].

2.1.1 Github Terminology

Fork - Forking a project essentially means that a developer creates their ownversion of the project which can become fully independent of the original.

Issue tracking system - A feature that enables users to report and discussissues, which can be for example bug reports or suggestions for new features tobe added to the project.

Watching - When a user watches a project they are subscribed to receivenotifications whenever changes are made in the project such as when new pullrequests and new issues are submitted. Due to updates made by Github, Watch-ers are named Subscribers in the API hence the latter term will be used in thisthesis.

Pull request - Takes place when a developer has made changes in the codeand wishes to submit them to the main project. This request is then reviewedby the core developers and if granted, the change is made in the code of themain project.

Contributor - A user who has contributed to the project by submitting atleast one commit sometime during the project’s course.

Star - The star feature allows a user to bookmark a repository for easier accessand at the same time showing appreciation to the maintainer of the repository.

Repository - As found in the study by Kalliamvakou et al. a repository neednot necessarily be a project, therefore a distinction between the two should bemade [4]. A repository denotes an arbitrary repository (either a base repositoryor a fork) while a project denotes the base repository with all its forks.

7

Commit - A commit to a repository is any change to the repository that ei-ther adds a new file, deletes a file, change file contents or file structure. Thischange will be recorded automatically by Git, and in order for the change to bepushed to Github, a commit has to be made. The commit also include a commitmessage that should describe what changes have been made, as well as othermetadata such as the author of the commit, a timestamp, a checksum etc3.

2.2 Open Source Development

The development process for open source projects hosted at online repositoriessuch as Github often differs from more traditional ways of creating software.The developers in Github projects often consists of people which develop thesoftware during their spare time as more of a hobby rather than as a job [5]. Incontrary to work-related software development the developers are in other wordsnot committed full time to the project. Furthermore, the developers are decen-tralised, contributing from different geographical locations, communicating andmeeting over the internet instead of meeting in person.

In contrast to traditional software development the team of developers isnot limited to a specific range of people, anyone who wishes can contribute toa project on Github as long as the repository is public. Another difference isthat there often is a lack of deadlines, software is developed “on the go” andthe tempo in which progress is made varies over time [5].

2.3 Correlation

Correlation is a measurement in mathematical statistics which measures thedependence between two or more variables. The variables can also be denotedas observations, and will be used synonymously to each other in this text. Inthis study a correlation between the commit frequency and repositories wouldindicate that there exists a relationship between different frequencies of com-mits and the popularity of a repository. It is important to distinguish betweencorrelation and causality because although a result might indicate a correlationbetween variables, it does not imply that there exists a causal relationship be-tween them. A causal relationship between variables is a situation where oneor several variables causes the outcome of the others, for example there existsa causality between the age of a child and its length. Correlation can indicateexistence of a potential causal relationship but it does not tell anything aboutthe underlying reason of the relationship. By measuring the correlation, thisstudy will therefore only be able to answer if there exists a relationship betweenthe two variables mentioned and not the causes of it.

2.3.1 P-value

An interesting value to study, besides the result of a statistical model (discussedbelow), is the p-value. The p-value represents the probability of the results beingthe same as the observations, or more extreme, given that the null hypothesisis true. In other words, if the results given by the model was the result of a

3https://developer.github.com/v3/repos/commits/

8

highly unlikely event or not. The null hypothesis in terms of this study is thatno correlation between the variables exist.

When the p-value is lower than the significance value selected, typically 0.05(5%) in most cases, the null hypothesis should be rejected. A higher p-valueindicates that the null hypothesis cannot be rejected. The significance level usedin this study will be 0.05.

2.3.2 Pearson Correlation Coefficient

Correlation is commonly used for linear relationships and could be measuredwith models that approximate a function. One common model is Pearson’sproduct-moment coefficient which measures how much data points deviate froma best-fit line. This model is only useful for two variables x and y that areeither on an interval or a ratio scale [6]. The value of the Pearson CorrelationCoefficient (PCC) ranges between -1 and 1, the closer a value is to either of thetwo, the higher the correlation. The PCC is computed following:

r =

∑ni=1(xi − x)(yi − y)√∑n

i=1(xi − x)2√∑n

i=1(yi − y)2

Where r denotes the PCC, xi and yi are members of the datasets {x1, x2, ..., xn}respectively {y1, y2, ..., yn} and x, y is the average of all x and y respectively.

There are guidelines with intervals for interpreting the resulting coefficient,but it is important to note that these intervals depend on the dataset measured[6]. For some datasets a 0.5 could be a strong positive correlation for example insocial sciences where there are a lot of complicated factors that could influencethe result [7]. In other datasets, for example calculations with physical lawsmeasured with very accurate tools, a correlation of 0.5 would indicate on a verylow correlation.

2.3.3 Spearman’s Rank Correlation Coefficient

Spearman’s Rank Correlation Coefficient is very similar to the Pearson Correla-tion Coefficient but instead uses the ranked values of the variables and assumesthe existence of a monotonic relation between the variables and evaluates thisrelation [8]. A ranked value is computed by comparing all the variables andranking them by their numerical value. A monotonic relation exists betweentwo variables if both variables increase together or when one increases, theother decreases. The Spearman Coefficient is computed using the formula:

rs = 1− 6∑

d2in(n2 − 1)

Where n represents the number of observations and di = rg(xi)−rg(yi) denotesthe difference between the two ranks of each observation.

2.3.4 Mann-Whitney U-test

Another Ranked Correlation test is the Mann-Whitney U-test which assignsranks to the different observations. The U value is computed firstly by assign-ing ranks to all observations, where rank 1 denotes the smallest value. If two

9

observations share the same value, instead of being assigned different ranks de-pending of order of occurrence, they are all given a rank equal to the averageof the ranks they would otherwise receive. Secondly the ranks for sample 1(x0, x1, ..., xn) are summed and the U value is then given by the equation:

U1 = R1 −n1(n1 + 1)

2

Where n1 is the sample size for sample 1 and R1 denotes the sum of ranks in sam-ple 1. U2 is then calculated using the same formula for sample 2 (y0, y1, ..., yn),and the smallest of U1 and U2 is then used for comparison in a statistical sig-nificance table.

Just as previous models the Mann-Whitney U-test does not require an as-sumption of a normal distribution between the variables but it requires that thedistribution under the null hypothesis is known. The null hypothesis denotesthat it is equally probable that a value selected by random from one sample willbe less than or greater than a value selected by random from a second sample[9].

2.3.5 The Model Used in the Study

Ranked models usually fits very well in studies where the data obtains valuesthat are roughly on the same scale. If the majority of the results are close toeach other but some are deviating significantly the ranking of the variables willbe misleading. For example, a dataset containing the values {41, 42, 43, 2000}would receive the ranks {1,2,3,4} and a dataset of values {41,42,43,44} wouldreceive the same ranking {1,2,3,4}, even though the values 44 and 2000 differs alot from each other. As mentioned previously there exists some projects with amassive userbase such as Open SSL on Github but most projects are of smallersize. The popularity was therefore believed to range on a wide scale with themajority of projects not very popular, which would increase the risk of a scenariosimilar to the one mentioned above. Therefore the two ranked models abovewill not be used in this project, leaving the Pearson Correlation Coefficient asthe remaining model of the three. The PCC is widely used and does not requireany ranking of the data and will therefore be used in this study.

2.4 Previous Research

There have been several studies investigating common features and correlationsof OSS projects regarding active contributors, programming languages or struc-ture, commits and other contributions [1, 2, 10, 11]. Many of the most recentstudies mainly focuses on Github development or datasets originated from theplatform such as stars, forks and issues. Some of the studies also include codeor detailed descriptions on how the Github data was acquired [1, 10, 12]. Twoof the mentioned studies used the Github v3 API4 to extract the dataset [10,12] while one of the studies used the GHTorrent5 dataset [1]. The study byPeterson [12] also included an algorithm for selecting a repository by random.The selection was done by choosing a random word from an arbitrary word listwhich is passed to the Github API, the API in turn returns a list of repositories

4https://developer.github.com/v35www.ghtorrent.org

10

containing the selected word in their descriptions. At last a repository in thatlist is selected at random and its contents extracted.

A study by Aggarwal et al. focused on what part documentation plays fora project’s popularity [1]. By extracting commits, forks, watchers and pullrequests Aggarwal et al. used these features of a repository to measure howpopular it was. The study measured how much the documentation changedover the lifetime of a project and by using cross correlation investigating howthis was related to the popularity of the project. The metric for popularity wasdefined by the formula Popularity = Stars + Forks + Pulls2, unfortunatelythe model used for computing the cross correlation is not described in detail.The study found an apparent relationship between the popularity of a projectand extensive (and consistent) documentation.

Studies by Kalliamvakou et al. [4] shows that the majority of all Githubrepositories are personal and inactive, which may have an significant impact onwhat conclusions that could be drawn from a dataset of Github repositories.The conclusions were made by analysing parts of the GHTorrent dataset andsending out surveys to GitHub users. It was also found that most projects havevery few commits therefore this should be kept in mind when analysing commitson GitHub.

Weicheng et al. [11] measured the relation between the frequency of commitsand evolution of file versions in eight large and successful projects on Github.The conclusion was that the average frequency of commits which involved atleast five changed lines of code in at least five different files were 5.34 days, whichwas around three times less frequent than commits of lesser size. The frequencywas related to the files in which changes were made and small commits, inparticular in core project files, resulted in large changes of code in followingcommits.

11

3 Method

3.1 Scraping Github

Collecting data from Github was an essential part of the project. There areseveral different methods of collecting data from Github that would yield thenecessary dataset for this project, each with advantages and setbacks. By com-paring the methods used in previous research (see section 2.4) we found thatusing the Github API would give us the largest quantity of useful data whilestill being completely free. This alternative was chosen instead of using GithubArchive6 along with Google BigQuery7 due to the simplicity of using the APIand the lower quantities of data it would imply, in contrast with Github Archivewhich would require terabytes of storage.

Using a script8 written in Python we collected repositories and their respec-tive commits by generating a random repository id as a variable when queryingfor a list of repository from the Github API. This list is represented in JSONformat, where every repository is an object within the list. After getting arepository list we filtered out all repositories that were forks, in other wordsonly extracting projects, as well as filtering out repositories smaller than 500bytes. We also removed some key:value pair in the repository object that wouldnot be used in the project in order to roughly remove most useless data (furtherreasoning of filtered-out data can be found in section 3.2.1). For each remainingrepository object in the list, all associated commits was then downloaded andsaved as a JSON list within the repository object called “commit”.

The data collection ran in parallel on two computers for the effective time ofabout 11 days. Since the API restricts the number of requests per hour to 5,000,the total number of projects ended up at 12,374 with accompanying 1,246,462commits. Due to inconsistency9 in the Github API and the way that the scriptcrawled through the commits, some projects ended up only having exactly 30commits in our dataset even though there would be more on the website. Thisresulted in development of another script which collected the remaining commitsby going through all projects once again. After this procedure the final totalnumber of commits ended up being 1,299,235, over 50,000 commits more thanoriginally.

6https://www.githubarchive.org/7https://cloud.google.com/bigquery/8Avalable at https://github.com/martengooz/github-scraper9Some API requests for certain repositories would not respond with the

header pair “Link: [string value]” used for page traversing as explained herehttps://developer.github.com/guides/traversing-with-pagination/. This made the scriptbelieve there were no more pages (i.e. commits) to request and therefore terminate after onlycollecting the default 30 commits on the first page.

12

3.2 Analysis

3.2.1 Metrics and Limits

Several metrics of popularity were used consisting of metadata extracted fromthe Github API. The metrics used to define popularity are listed below:

StarsThe number of stars was one of the metrics used and was chosen primarily sinceGithub uses this as a metric themselves to rank the currently trending repos-itories10. This metric seems to be a good start and a general baseline of howpopularity can be measured.

ForksThe number of forks of a project was one of the chosen metrics since a highnumber of forks could be an indication of a popular software project since ahigh number of people wish to modify the source code and creating their ownversion. This value is however much lower than stars in the general case.

Stars + Forks + SubscribersThis metric includes the developer aspect by using forks as one of the metricscombined. Forking a project is an action made by a developer since it involvesinteraction with the project’s code. Starring or subscribing to a project can bean action carried out by both a project developer or user, by including forksthe developer aspect is therefore combined with the user aspect. This sum willhenceforth be denoted by SFS for shorter referencing.

The formula used in Aggarwal et al. [1] that was mentioned in section 2.4was not used due to difficulties of retreiving the number of pulls from the API.Another problem with using the number of pulls as a metric of popularity is thatonly a small part of projects on Github use pull requests [4]. Further discussionregarding issues with different metrics can be found in section 5.2.2.

The limits for the different variables which a project had to fulfill in orderto be included in the dataset used in the analysis were the following:

• Stars > 10

• Subscribers > 10

• Forks > 1

• Commits > 100

After weeding out the projects that did not fulfill the above limits only 85remained, creating a loss of nearly 99.3% of the originally downloaded projects.As seen in section 5 there exists several data points deviating significantly fromthe others. Due to the very low number of projects remaining the choice wasmade to keep the deviating projects to avoid further reduction of the dataset.

10https://github.com/trending

13

3.3 Procedure

The data was analysed using a program written in Python using the SciPypackage for measuring the correlation. In order to get the full picture of thedata correlation, different time intervals of the projects were studied along withvarious metrics for measuring the popularity. The first time periods lookedat was the first, respectively last month of a project. The commits in theseintervals were counted and then the average of commits per day was calculated.The next time period studied was the entire lifetime of a project, where theaverage commit frequency per hour was measured. Lastly, the weekly activityof projects was studied by measuring the highest number of weeks in a rowcommits were made and respectively the highest number of weeks in a rowcontaining no commit activity at all.

14

4 Results

The result shown below is split up into five parts connected to the hypothesesdescribed in section 1.4. Each part consists of three plots, one for each metric(stars, forks, SFS), except for the fourth hypothesis which includes six plots.Each point in the graphs represents a single project and the line represents thebest fitting line with the Pearson Correlation Coefficient denoted r togetherwith p, the p-value for the model.

4.1 First Hypothesis

The first month of a project is where the most frequent commits happens ata total average of 1.74 commits/day. It is noticeable that the majority of theprojects are located near the bottom of each graph, meaning that most projectshave a low popularity score. The projects also are distributed mainly aroundthe lower left corner indicating that a there are few commits per day. ThePCC is positive in each metric, measuring 0.0479, 0.1396 and 0.1164 in theforks, stars and SFS case respectively. The p-value is significantly higher whenmeasuring popularity solely by forks at 0.6632 compared to 0.2026 (stars) and0.2888 (SFS).

(a) SFS (b) Stars

(c) Forks

Figure 1: Average daily commits first month

15

4.2 Second Hypothesis

The last month shows how much the end or current state of a project affectsthe popularity. In this instance the majority of the projects are also locatednear the bottom left of each graph. The average commit frequency is almost afifth of the one in the first month being 0.39 commits/day. The PCC is slightlypositive in each metric and in the case of measuring popularity by the numberof forks the PCC value is 0.2112 combined with a p-value of 0.0524 comparedto the result in the first month where forks had a higher PCC value of 0.0479and p-value of 0.6632. The values for the other graphs stay within a 0.1 marginfrom the previous result with the exception of SFS p-value that decreased from0.2888 to 0.1225. There is also a noticeable vertical line at exactly one commitper day consisting of 8 projects in total.

(a) SFS (b) Stars

(c) Forks

Figure 2: Average daily commits latest month

4.3 Third Hypothesis

When comparing the average commit frequency over the whole lifetime of theproject we again see that most values are centered near the bottom left. Theaverage number of commits throughout all projects is 0.75. The results areoverall similar to each other between the metrics used for popularity and thePCC has a minor positive value between 0.0112 and 0.0659 and the p-value isclose to 1 in both the case of stars and SFS.

16

(a) SFS (b) Stars

(c) Forks

Figure 3: Average daily commits in the repositories lifetime

4.4 Fourth Hypothesis

The longest streak where there had been at least one commit each week showedto be a special case where a single project made a significant difference tothe PCC when included. For this reason we provided two graphs for the samemetric, one with all projects included (the left column) and one with the projectremoved (the right column). The removed project is the named “broadgsa/gatk”and can be seen rightmost on the images in the left column. The removal of theproject more than doubled the PCC in each case and the p-value decreased by0.3483 in average, which resulted in both SFS and forks having a p-value underthe 0.05 significance level threshold.

17

(a) SFS (b) SFS without extreme case

(c) Stars (d) Stars without extreme case

(e) Forks (f) Forks without extreme case

Figure 4: Weekly commit activity streak

18

4.5 Fifth Hypothesis

When measuring the weeks of inactivity in a row, the PCC is slightly negativeat 0.08 and the p-value is 0.48 ± 0.015 for all cases. Overall the most popularprojects are within the 0-60 weeks range of inactivity, and only eight projectshave had over 2 years of inactivity before another commit was pushed again.

(a) SFS (b) Stars

(c) Forks

Figure 5: Weekly inactivity streak

19

5 Discussion

The result was unexpected, such a weak correlation coefficient between the com-mit frequency in different time intervals and the different metrics of popularitywas beforehand deemed highly unlikely. There are, however, certain aspectsthat might have contributed to this result. These involve the source of data(Github in this case), metrics used to measure popularity and the extracteddataset itself.

5.1 Result

The result is not as distinct as we hoped, but it does follow the expectationsof a study of this character. Most of the projects are neither popular nor havehigh values in any of the other values measured in this report, which is followingthe conclusions of Kalliamvakou et al. [4].

5.1.1 Correlation

An overall convincing correlation cannot be found in any of the cases sincethe values are distributed in such way that a high value in popularity doesnot always correspond to a high frequency during that period. This is alsorepresented by the PCC which does not exceed the absolute value of 0.3 exceptfor one occasion, which would indicate that it could be a weak correlation dueto the low p-value according to the standards of Cohen (1998) [13]. However,all coefficients are slightly positive when the hypothesis expected them to be,and in the fifth hypothesis the coefficient should be negative in order to agreewith the hypothesis, which it does.

The absolute weakest correlation coefficient was found when examining hy-pothesis 3, measuring the correlation between average commit frequency through-out the project’s lifetime. Since the projects are randomly chosen and some havebeen stable for years with only a few regular commits each month for mainte-nance, gaining a lower average frequency compared to when they were new,whilst others are just in the starting period of the development and have notreached their full potential in popularity yet. This makes an unfair compari-son between different length projects, which could be why the PCC is so low.Another reason could be that projects with a high commit frequency but lowpopularity could be company or industry related projects that are not usedwidely outside that group, but is still being actively developed.

The biggest difference between the results for each hypothesis occured whenthe number of forks determined a project’s popularity, which in some cases hada noticeable higher PCC and in one case, a much lower. This questions whetherthe number of forks was a good metric to describe popularity on Github, orif the stars and SFS gave a false picture. This topic will be discussed later insection 5.2.2. Although the number of forks ended up having the highest PCCvalue (0.3480) and lowest p-value (0.0012) which could mean a weak correla-tion according to Cohen (1998) [13]. This was, however, after a datapoint wasremoved as described in section 4.4, and with a larger dataset the values couldeither increase or decrease. Although the higher value might not be accuratein the real world, it is an indication of what could be closer to the real valueif the removed project is an exception. For this study though, this will not be

20

considered as a weak correlation due to the circumstances of having a very smalldataset.

5.1.2 Abnormalities and Extreme Cases

There are some extreme cases in the dataset that both correspond to our hy-potheses and others indicating the opposite by showing no correlation. For thefourth hypothesis a project was removed in order to show the magnitude of dif-ference that a single project could make. Leaving the other deviating projectsin the dataset has probably also affected the correlation coefficient negativelybut on the other hand left more projects to study in the dataset.

In section 4.2 there is a noticeable vertical line at exactly 1 commit per daywhich seemed unnatural at first sight since the next higher value is at locatedat 2.2 on the x-axis. A possible explanation for this phenomenon is that theseprojects are perhaps following a policy of at least one commit per day in orderto maintain development. Other alternatives are that somewhere the datasetbecame corrupt or that the data analysis software has a bug. Although it isonly 8 projects that demonstrates this behaviour, it is quite a peculiar result.

Taking all this to account the overall results show no direct correlation betweencommit frequency and popularity. However, the results indicate that althoughno direct correlation can be found, many projects follow hypothesis 1, 2, 4 and5 with some exceptions showing the opposite. The conclusion would thereforebe that many popular projects follow the rule that the hypotheses make, butfollowing them is not a recipe for success, at least not with the metrics used inthis study.

5.2 Possible Error Sources

5.2.1 Source of Data

Being the biggest platform for code repositories Github was chosen as the sourceof data for this study. Using Github as the source of data has its advantagesand disadvantages, due to its position as the most popular platform it offers avast number of repositories which allows building large datasets to work with.However, the adequacy of the repositories as test data varies to a large extent,a large number of the repositories retrieved initially was deemed unfit for thedataset due to their small amount of commits in accordance with the studyby Kalliamvakou et al. [4]. This problem was attended to by screening therepositories with less than 100 commits but this is an arbitrary limit which wasdecided upon to filter out the worst repositories. 100 commits is a relativelysmall number of commits to study and the few number of data points maytherefore give an inaccurate distribution of commit frequency in the differentintervals. An example of this can be seen in section 4.4.

As Kalliamvakou et al. [4] also stated most projects are inactive and sug-gested filtering out projects with no recent commits or pull requests. Unfortu-nately there was no time to implement such filtering as it would require furtheranalysis of the dataset to define a time period which would classify a project asinactive. Inactive projects should not have a major effect on the result in thisstudy since the project lifetime is defined as the period ranging from the time

21

of the first commit to the last commit. Another peril with Github mentionedby Kalliamvakou et al. [4] is the large amount of repositories that are not usedfor software development. The study suggested analysis of the README fileand the description as a solution to this eventuality. Initially during the devel-opment of the scraper software there were plans for including code for NaturalLanguage Processing to analyse the description and README. This would in-crease the complexity of the software even more and the plans were abandonedwhen the discovery of the corrupt commit data was made which resulted in alack of time to implement such features. As such, the quality of the datasetcould have been affected by the existence of projects unfit for this study whichwould have a negative impact on the results.

Github Archive could have been used as the source of data instead whichwould provide additional interesting information, such as when a project gaineda star or fork, this would have allowed metrics as derivatives between the com-mits to be used. The downside of using Github Archive is the vast increase ofdata which would have to be processed, the activity of one year make up around600GB, compared to the 200MB file which was used in this study.

5.2.2 Similarities between Metrics of Popularity

Popularity is difficult to define, even though Github offers plenty of metadatathrough their API it is still far from easy to determine which data that representsa project’s popularity. Studying the results one can argue that the differentmetrics for measuring popularity have a very low significance as the resultsare very similar to each other. Had there been a noticeable difference betweenthe results it could have constituted grounds for arguing that one metric maybe better than the other but as so is not the case, this study will not claimany of the chosen metrics to be better than the others. The underlying reasonfor the very similar results is believed to be rooted in the fact that projectsgain very few stars, forks and subscribers in general, which makes these metricsunfit for the task of this study (see Figure 6 for the distribution of forks andstars). Furthermore, there exists a strong correlation of from 0.7242 up to 0.7922between stars, forks and subscribers, indicating that all metrics represents thepopularity similarly and making one of the three metrics somewhat redundant.Figure 7 shows the correlation between Stars/Forks and Stars/Subscribers, whilethe correlation between Forks/Subscribers is left out since it yielded a similarresult of 0.7815.

Another problem with only using one of the metrics is that for examplethe amount of stars is not tied to the extent of usage of a project. The OpenSSL11 project has nearly 4600 stars on Github while Node12 has over 34000,even though the former is used to a larger extent1314. The low number offorks observed could be related to usage of other platforms like Mercurial15.A project can easily be cloned from Github and then being further developedusing another platform which leads to no a fork never being recorded.

11https://github.com/openssl/openssl12https://github.com/nodejs/node13https://trends.builtwith.com/Server/OpenSSL14https://w3techs.com/technologies/details/ws-nodejs/all/all15https://www.mercurial-scm.org/

22

(a) Forks (b) Stars

Figure 6: The distribution of forks and stars

(a) Stars and forks (b) Stars and subscribers

Figure 7: Correlation between different metrics of popularity

5.2.3 Other Metrics of Popularity

There were one other type of metadata that were supposed to be used as metricsof popularity in this study, the number of pull requests, but due to problemswith the API we could not extract this data. This metric reflects the developeraspect which could have lead to a difference in the results. However, usingpull requests as a metric would have meant further limitations of the dataset,since only a small part of repositories use them as found by Kalliamvakou etal.[4]. A metric more fit to measure popularity would probably be the numberof downloads (or clones) of a project as this represents the usage of a project ina fairer way than the currently used metrics, since this action does not requireusers to be logged in. This data is not provided by Github and therefore it couldnot be used in this study. It should be stressed, however, that the popularityrepresented by a metric need not represent the actual popularity and vice versa.

23

5.2.4 The Dataset

The dataset extracted has a significant impact on the result and there are severalaspects of it that needs to be considered. Firstly, the initially received datawas corrupted by incorrect replies by the Github API as stated in section 3,fortunately the corrupt commit data was discovered but this was not the onlyincorrect data returned by the API. Seven occurrences of corrupt timestampswere also discovered, a very small fraction in relation to the total number ofcommits but the occurrences indicate that there might be more incorrect datain the dataset which has not been discovered. If such data exists it might havehad an impact on the results but it is impossible to estimate the extent of it.

Another discovery made in the dataset was that there existed projects whichhad used the Git version control language before Github existed. This meansthat during the time between a project’s creation and the time when it wasuploaded to Github, it could not receive any metadata connected to Github. Inother words the project could not receive stars for example during this period,meaning the conditions of gaining popularity (as measured in this study) wasdifferent for some of the projects and thus needs to be considered.

Another important aspect of the dataset is its size. Over 12000 projectswere retrieved from the Github API but after weeding out the ones not fulfillingthe prerequisites only 85 remained. This reduction clearly confirms the findingsof Kalliamvakou et al. [4] which stated that most projects are irrelevant forresearch of this kind. The small number of projects has a significant impact ondrawing conclusions from the result since the amount only constitutes a smallsubset of all the projects on Github. The results can therefore not be seenas representative for projects on Github which limits the possibility of drawinggeneral conclusions significantly, even if there had been an indication of a strongcorrelation the result would not have been entirely trustworthy.

5.3 Future Research

Evaluating the results and the methods used to retrieve them there are severalthings that future studies would do well to consider. Regarding the source ofdata, there probably are advantages to using a platform which offers the numberof downloads for a project, something Github does not, as mentioned previously.An advantage with Github is the vast number of accessible repositories, usinganother platform one might become limited with the number of repositoriesoffered, depending on how extensive the research is.

One alternative source of data instead of the Github API is Github Archivesince it contains the complete history of all Github activity from February 2011and thus providing when certain events of a project took place, such as whenit received a star. This alternative is quite demanding in terms of computingpower and storage so one has to account for these parts when using GithubArchive as source of data.

If Github is used one has to consider the time and effort it takes to gatherand filter the data. A large amount of the repositories are probably consideredirrelevant by most studies and weeding these out can be demanding and timeconsuming work. There are several ways of extracting data from Github butone of the easier is using their API. Using this option may result in receivingincorrect data, the instructions supplied for page navigation supplied by Github

24

resulted in a corrupt dataset during this study which meant additional work tofind a working solution and also revising the dataset to find the corrupt parts.By controlling the replies from the API with the real repositories will lowerthe risk of a corrupt dataset but one still needs to consider the possibility ofincorrect replies by the API. The hard limitation of 5000 requests per hour mayalso prevent a shorter research project from achieving a desired dataset size.

25

6 Conclusions

The results show no strong indication of correlation between commit frequencyand popularity. In all measurements the metric of popularity takes low valuesfor the most of projects. One could argue that this indicates that most of theprojects that received a low popularity score are unpopular, but that is notnecessarily the case. Instead the low values of popularity can be consequenceof the inadequacy of the metric itself to represent popularity, the number ofdownloads for example would be a better solution or complement to measurepopularity. Even though the metrics used in this study may afterwards bedeemed as unfit, the results still give a weak indication of certain relationshipsbetween commit frequency and popularity. Long streaks of weekly commitactivity could be correlated to popularity, as our results give a weak indicationof such a relationship. Also, long streaks of weekly inactivity in the sense ofcommits could have a negative relationship towards popularity, as one of themeasurements indicated a very weak - but still negative - correlation coefficientfor the two variables. However, to be able to confirm/verify the possibilitiesof mentioned relationships, further research using a larger dataset and bettermetrics would have to be conducted.

26

References

[1] Karan Aggarwal, Abram Hindle, and Eleni Stroulia. “Co-evolution ofProject Documentation and Popularity Within Github”. In: Proceedingsof the 11th Working Conference on Mining Software Repositories. MSR2014. New York, NY, USA: ACM, 2014, pp. 360–363. isbn: 978-1-4503-2863-0. doi: 10.1145/2597073.2597120. url: http://doi.acm.org/10.1145/2597073.2597120 (visited on 05/11/2017).

[2] T. F. Bissyande et al. “Got issues? Who cares about it? A large scaleinvestigation of issue trackers from GitHub”. In: 2013 IEEE 24th Inter-national Symposium on Software Reliability Engineering (ISSRE). Nov.2013, pp. 188–197. doi: 10.1109/ISSRE.2013.6698918.

[3] Klint Finley. Github Has Surpassed Sourceforge and Google Code in Pop-ularity. June 2011. url: http://readwrite.com/2011/06/02/github-has-passed-sourceforge/ (visited on 05/11/2017).

[4] Eirini Kalliamvakou et al. “The Promises and Perils of Mining GitHub”.In: Proceedings of the 11th Working Conference on Mining Software Repos-itories. MSR 2014. New York, NY, USA: ACM, 2014, pp. 92–101. isbn:978-1-4503-2863-0. doi: 10.1145/2597073.2597074. url: http://doi.acm.org/10.1145/2597073.2597074 (visited on 05/11/2017).

[5] Audris Mockus, Roy T. Fielding, and James D. Herbsleb. “Two CaseStudies of Open Source Software Development: Apache and Mozilla”. In:ACM Trans. Softw. Eng. Methodol. 11.3 (July 2002), pp. 309–346. issn:1049-331X. doi: 10.1145/567793.567795. url: http://doi.acm.org/10.1145/567793.567795 (visited on 05/11/2017).

[6] Pearson Product-Moment Correlation - When you should run this test,the range of values the coefficient can take and how to measure strengthof association. url: https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php

(visited on 05/11/2017).

[7] Pearson correlation coefficient. en. Apr. 2017. url: https://en.wikipedia.org/w/index.php?title=Pearson_correlation_coefficient&oldid=

775345264 (visited on 05/11/2017).

[8] Spearman’s Rank-Order Correlation. url: https://statistics.laerd.com / statistical - guides / spearmans - rank - order - correlation -

statistical-guide.php (visited on 05/23/2017).

[9] Mann–Whitney U test. Feb. 2017. url: https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test (visited on 05/23/2017).

[10] Oskar Jarczyk et al. “GitHub Projects. Quality Analysis of Open-SourceSoftware”. en. In: Social Informatics. Springer, Cham, Nov. 2014, pp. 80–94. doi: 10 . 1007 / 978 - 3 - 319 - 13734 - 6 _ 6. url: https : / / link .

springer.com/chapter/10.1007/978- 3- 319- 13734- 6_6 (visitedon 05/11/2017).

27

[11] Y. Weicheng, S. Beijun, and X. Ben. “Mining GitHub: Why Commit StopsExploring the Relationship between Developer’s Commit Pattern and FileVersion Evolution”. In: 2013 20th Asia-Pacific Software Engineering Con-ference (APSEC). Vol. 2. Dec. 2013, pp. 165–169. doi: 10.1109/APSEC.2013.133.

[12] Kevin Peterson. “The GitHub Open Source Development Process”. In: ().url: http://kevinp.me/github-process-research/github-process-research.pdf (visited on 05/11/2017).

[13] Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. en.Routledge, May 2013. isbn: 978-1-134-74277-6.

28

Date post:	03-Feb-2022
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

Measuring correlation between commit frequency and ...

Documents