+ All Categories
Home > Technology > The Perills of Doing Software Engineering Research using Github Data

The Perills of Doing Software Engineering Research using Github Data

Date post: 20-May-2015
Category:
Upload: dmgerman
View: 89 times
Download: 1 times
Share this document with a friend
Description:
With over 10 million \git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features---namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.
Popular Tags:
19
What is in github? Daniel M German [email protected]
Transcript
Page 1: The Perills of Doing Software Engineering Research using Github Data

What is in github?

Daniel M [email protected]

Page 2: The Perills of Doing Software Engineering Research using Github Data
Page 3: The Perills of Doing Software Engineering Research using Github Data
Page 4: The Perills of Doing Software Engineering Research using Github Data

Researcher states:

“40% of pull requests are not merged”

● Based on simply querying ghtorrent data● But it ignores what really happens● Many pull requests are merged without being marked as merged in github

● Ghtorrent data has many potential threats to validity

Page 5: The Perills of Doing Software Engineering Research using Github Data

What is github used for?

Page 6: The Perills of Doing Software Engineering Research using Github Data

"I store my presentations in github. I don't need USB stick anymore!"

Page 7: The Perills of Doing Software Engineering Research using Github Data
Page 8: The Perills of Doing Software Engineering Research using Github Data
Page 9: The Perills of Doing Software Engineering Research using Github Data

Are there potential threats to validity for studies that assume github is about software engineering

only?

Page 10: The Perills of Doing Software Engineering Research using Github Data

Methodology

● Reuse:– Surveys

– Data analysis for other papers

● Mixed methods:– Quantitative, and

– Qualitative

Page 11: The Perills of Doing Software Engineering Research using Github Data
Page 12: The Perills of Doing Software Engineering Research using Github Data
Page 13: The Perills of Doing Software Engineering Research using Github Data

Uses:

Page 14: The Perills of Doing Software Engineering Research using Github Data

Most projects are inactive

Page 15: The Perills of Doing Software Engineering Research using Github Data

Social?

67% of projects a personal repos

95% have 3 or less committers

Page 16: The Perills of Doing Software Engineering Research using Github Data

Self contained?

“Any serious project would have to have someseparate infrastructure - mailing lists, forums, ircchannels and their archives, build farms, etc. [...]Thus while GitHub and all other project hosts areused for collaboration, they are not and can not

be a complete solution.”

Page 17: The Perills of Doing Software Engineering Research using Github Data

But.. what about the users?

Page 18: The Perills of Doing Software Engineering Research using Github Data

Switch to http://osrc.dfm.io/dmgerman

Page 19: The Perills of Doing Software Engineering Research using Github Data

Is it still worth exploring github?

Definitely!


Recommended