Is the Pareto Principle Applicable to the Core Teams
of GitHub Projects?
KazuhiroYamashita
YasutakaKamei
ShaneMcIntosh
NaoyasuUbayashi
Ahmed E. Hassan
Core developers play a critical role
in software development
2
Core developers are responsible for guiding and coordinating the development of an OSS project.
The most productive developers who have made roughly 80% of the total contributions.
Nakakoji
Mockus
In fact, some argue that core developers in OSS projects follow the Pareto Principle
5Effort Result
80% 80%
20%20%
Pareto Principle in Software Development
6
20%
80% 20%
80%
ProjectDevelopers Artifacts
Prior studies have arrived at mixed conclusions about core teams and the Pareto Principle
7
Pareto Non-Pareto
Goeminne IWSQM
RoblesRAMSS
MockusTOSEM
GeldenhuysECSEAA
KochISJ Dinh-Trong
TSE
The results depend on small number of case study systems
Other
Prior studies have arrived at mixed conclusions about core teams and the Pareto Principle
8
< 10 or 15 Other
Goeminne IWSQM
RoblesRAMSS
MockusTOSEM
GeldenhuysECSEAA
KochISJ
Dinh-TrongTSE
Overview of our study of core teams on GitHub
19
Applicability of the Pareto PrincipleNumber of Core Developers
Overview of our study of core teams on GitHub
20
Core and Non-Core Developers Activities
Applicability of the Pareto PrincipleNumber of Core Developers
Collecting and analyzing GitHub data to study core team activity
21
Filter Heuristics
Core
Non-Core
Core
Non-Core
Calc Prop
Projects
Core
Non-CoreClassifyCommits
Core Team Size Activity
Collecting and analyzing GitHub data to study core team activity
22
Filter Heuristics
Core
Non-Core
Projects
22
Core
Non-Core
Calc Prop
Core
Non-CoreClassifyCommits
Core Team Size Activity
Preprocessing GitHub data to handle forks, duplicates, and to remove immature projects
23
8,510,504 repositories -> 2,496 repositories
Collecting and analyzing GitHub data to study core team activity
24
Filter Heuristics
Core
Non-Core
Projects
24
Core
Non-Core
Calc Prop
Core
Non-CoreClassifyCommits
Core Team Size Activity
Using heuristics to identify core team members
26Commit-based LOC-based Access-based
Core Core Core
29A B C D
Our commit-based core contributor heuristic
Number of Commits
= Commit
Step1: Sort contributors by their number of commits
30A BC D
Number of Commits
Step2: Compute the proportion of commits that each contributor
32A BC D
60% 20% 10% 10%Commits ratio
Step3: Core contributors are those developers below the 0.8 cumulative contribution cutoff
33A BC D
0.8
1.0
0.6
Cumulativeratio
Pct. CoreDev2/4*100 = 50%
Num CoreDev2
Collecting and analyzing GitHub data to study core team activity
35
Filter Heuristics
Core
Non-Core
Projects
35
Core
Non-Core
Calc Prop
Core
Non-CoreClassifyCommits
Core Team Size Activity
Overview of our study of core teams on GitHub
36
Core and Non-Core Developers Activities
Applicability of the Pareto PrincipleNumber of Core Developers
Overview of our study of core teams on GitHub
37
Core and Non-Core Developers Activities
Applicability of the Pareto PrincipleNumber of Core Developers
Collecting and analyzing GitHub data to study core team activity
38
Filter Heuristics
Core
Non-Core
Projects
38
Core
Non-Core
Calc Prop
Core
Non-CoreClassifyCommits
Core Team Size Activity
Our approach to study Core Team Size
40
30%20%10%Percentage of Core Devs
Compliance with the Pareto Principle
Stratify projects along the confounding factors
Small Medium Large Small Medium Large Small Medium LargeLOC Total Author Age
The example project does not follow the Pareto Principle
Core team proportions are widespread
43
Commit-based Divide by LOC
Often, there are fewer than 15 core developers in a projects
44
Number of core developers in projects
88% 98% 96%Commit-Based LOC-Based Access-Based
Overview of our study of core teams on GitHub
45
Core and Non-Core Developers Activities
Applicability of the Pareto PrincipleNumber of Core Developers
More than half projects do not follow the Pareto principle
Most of projects have 15 or less core developers
Overview of our study of core teams on GitHub
48
Core and Non-Core Developers Activities
Applicability of the Pareto PrincipleNumber of Core Developers
More than half projects do not follow the Pareto principle
Most of projects have 15 or less core developers
Collecting and analyzing GitHub data to study core team activity
49
Filter Heuristics
Core
Non-Core
Projects
49
Core
Non-Core
Calc Prop
Core
Non-CoreClassifyCommits
Core Team Size Activity
Our approach to study activity
50
By using the keywords, we classify the commits.
DevelopmentActivity Type KeywordsForward Engineering implement, add, requestMaintenanceReengineering optimiz, adjust
Corrective Engineering bug, fix, issue, error
Management license, formatting, TODO
No big differences in proportions of development activities
54
Commit-Based LOC-Based Access-Based
Overview of our study of core teams on GitHub
55
Core and Non-Core Developers Activities
Applicability of the Pareto PrincipleNumber of Core Developers
More than half projects do not follow the Pareto principle
Most of projects have 15 or less core developers
There are no big differences between
core and non-core activities
Overview of our study of core teams on GitHub
56
Core and Non-Core Developers Activities
Applicability of the Pareto PrincipleNumber of Core Developers
More than half projects do not follow the Pareto principle
Most of projects have 15 or less core developers
There are no big differences between
core and non-core activities
Extremely large core team may be interesting
58
Heuristic -15 16-20 21-50 51-100 101-
Commit-Based
2,197 98 137 17 47
LOC-Based
2,454 15 13 4 10
Access-Based
1,164 24 24 0 0
Many projects face a risk of bus factor
59
Commit-Based LOC-Based Access-Based43% (Core=1: 8%) 81% (Core=1: 24%) 54% (Core=1: 21%)
In fact, most of projects have less than 5 core developers
Conclusion
63
64
Core Developer• additional slides
65
Additional description of our definition
66
0.8
1.0
A B C D E Depend on Name
Commit-based
67
Age Total Author
LOC-based
68
Age Total Author
LOC
Access-based
69
Age Total Author
LOC
Data Extraction
70
8,510,504 repositories -> 4,618 repositories
Data Extraction
71
Data Extraction
72
(1) Filter projects by GHTorrent
Filter forked repositories.
Fork
73
One of the features of GitHub
Fork (clone)
Original Repository
Fork Repository
Pull Request
Data Extraction
74
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Data Extraction
75
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Filter repositories which is developed outside of GitHub.
Data Extraction
76
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Filter repositories which is developed outside of GitHub.
8,510,504 repositories -> 4,618 repositories
Data Extraction
77
Data Extraction
78
(2) Clone repositories
4,618 repositories -> 4,154 repositories
local server
clone
Data Extraction
79
Data Extraction
80
(3) Filter duplicate projects
Project A Fork of Afork
clone
Project Bregister
Clone of A
Data Extraction
81
(3) Filter duplicate projects
4,618 repositories -> 3,533 repositories
Project A Project B
Compare SHAs
c87cce1e1a7260f40ccb5455e44c8b67f28651fa5e
655b8be757dd93a4cf3718145880cf484e34e63bde
Data Extraction
82
Data Extraction
83
(4) Calculate metrics
LOCTotal CommitsTotal Authors
AgeRepository
Data Extraction
84
Data Extraction
85
(5) Filter projects by metrics
4,618 repositories -> 2,496 repositories
Filter less than 10 devs repositories.
Filter less than 1,000 LOC repositories.