A data driven journey through research on software engineering

Post on 19-Jun-2015

542 views 1 download

Tags:

transcript

A DATA-DRIVEN JOURNEY THROUGH RESEARCH ON SOFTWARE ENGINEERING

Mario Sangiorgio

MOTIVATION

Getting a better idea of what’s going on in software engineering research community

through a quantitative approach

RELATED WORKS•C. Ghezzi - Keynote at ICSE 2008

Reflections on 40+ years of software engineering research and beyond

•L. Briand - Keynote at ICSM 2011Useful software engineering research: leading a double agent life

•D. Rosemblum - Keynote at ASE 2012Whither software engineering research?

SUBJECTS OF OUR STUDY

researchers

affiliations geographical areas

research topics

DATA

ACADEMIC LITERATURE

SELECTED PUBLICATIONS

REPRESENTATIVENESS

AUTHORITATIVENESS

DATA SOURCES

Articles published and their authors

Citations, authors and affiliation details

COMPLETE XML DATABASE

APIs

COLLECTED DATAVenue Number of papers From To

TSE 3043 1975 2012TOSEM 295 1992 2012

ICSE 2907 1976 2012ASE 1116 1997 2012

ESEC/FSE 416 1987 2012TOTAL 7777 1975 2012

9865 researchers 278794 citations

ANALYSIS

AUTHOR ANALYSIS

Who published the most?

Are there sub-communities?

MOST PROLIFIC AUTHORSSoftware

EngineeringICSE ASE ESEC/FSE TSE TOSEM

Basili60

Bohem28

Xie24

Clarke8

Basili33

Notkin13

Notkin56

Basili26

Grundy18

D. Jackson8

Briand26

Rothermel8

Kramer49

Osterweil23

Hosking16

Ernst7

Weyuker18

Roman6

Harrold46

Kramer21

Egyed16

Notkin7

Knight17

Wolf6

Xie46

Notkin21

Lo16

Uchitel7

Kramer16

Harrold6

SUB-COMMUNITY DETECTION

For each venue we consider the top most

prolific authors

We compute the set similarity between all

the pair of venuesJ(A,B) =

|A \B||A [B|

SUB-COMMUNITIES

−0.2 0.0 0.2 0.4 0.6

−0.2

0.0

0.2

0.4

mds[,1]

mds[,2]

TSE

TOSEM

ICSE

ASE

FSE

TOPIC ANALYSIS

What is the topic of a paper?

What are the hot topics in software engineering?

How have they evolved?

CITATION NETWORK

Papers in the dataset

CITATION NETWORK

Internal citations

CITATION NETWORK

Complete citations

Citations from specific venues

EXAMPLE

What is the topic of the yellow paper?

EXAMPLEWhat is the topic of the yellow paper?

Topic Direct citationsTopic A 2Topic B 0General 1

What is the topic of the general paper?

EXAMPLEWhat is the topic of the yellow paper?

Topic Direct citationsTopic A 2Topic B 1General 1

Topic profileTopic profile

Topic A 66%

Topic B 33%

SOFTWARE ENGINEERING TOPICS

Topic Fraction of papersProgramming Languages 9.34%

Formal Methods 8.49%Software Reliability 6.13%Distributed Systems 5.96%

Software Maintenance 5.92%Testing 4.64%

Software Quality 4.53%Models 4.36%

Software Architectures 4.36%

TOPICS IN THE ‘70STopic Fraction of papers

Programming Languages 16.71%Performance 7.95%

Operating Systems 7.29%Database Systems 6.84%Formal Methods 6.65%

Software Architectures 6.14%Knowledge Engineering 5.69%

Distributed Systems 4.94%Software Maintenance 4.18%

By far the most represented

Topics from other fields

TOPICS IN THE ‘80STopic Fraction of papers

Programming Languages 10.48%Distributed Systems 9.30%

Knowledge Engineering 8.47%Software Reliability 6.68%Formal Methods 6.51%

Information Systems 5.55%Software Maintenance 5.04%

Models 4.35%Artificial Intelligence 3.74%

Significant rise

Other fields, related to

distributed systems

Not only code

TOPICS IN THE ‘90STopic Fraction of papers

Formal Methods 8.29%Programming Languages 8.13%

Distributed Systems 6.80%Software Maintenance 6.55%Software Architectures 5.34%

Software Quality 4.80%Knowledge Engineering 4.67%

Models 4.65%Information Systems 4.40%

Change of the most published

topic

Focus on software quality

TOPICS IN THE 2000STopic Fraction of papers

Formal Methods 9.93%Programming Languages 8.37%

Testing 6.86%Software Maintenance 6.58%

Software Reliability 6.22%Software Quality 5.72%

Models 4.80%Empirical Studies 4.76%

Software Architectures 4.38%

Analysis of open source repositories

Still lot of emphasis on

software quality

NEED FOR A FINER ANALYSIS

SOLUTION: sliding window instead of fixed subdivision

Topics change constantly, not once in a decade

TESTING

0

0.05

0.09

0.14

0.18

1975 1980 1985 1990 1995 2000 2005

EMPIRICAL STUDIES

0

0.05

0.09

0.14

0.18

1975 1980 1985 1990 1995 2000 2005

SERVICES

0

0.05

0.09

0.14

0.18

1975 1980 1985 1990 1995 2000 2005

DISTRIBUTED SYSTEMS

0

0.05

0.09

0.14

0.18

1975 1980 1985 1990 1995 2000 2005

PROGRAMMING LANGUAGES

0

0.05

0.09

0.14

0.18

1975 1980 1985 1990 1995 2000 2005

PER-VENUE INSIGHTSVenue Peculiarities

TSE Biased towards empirical works

TOSEM More focused on formal aspects

ICSE Balanced with respect to other venues

ESEC/FSE Formal, with interests in testing, modeling and requirements engineering

ASE Interests in program analysis and automated reasoning

AFFILIATION ANALYSIS

Where do the most prolific authors work?

How much research is done in industry?

AFFILIATION PROFILE

Author AffiliationAuthor A 1Author B 2Author B 2

Affiliation profileAffiliation profile

Affiliation 1 33%

Affiliation 2 66%

MOST PROLIFIC AFFILIATIONSAffiliation Papers

IBM 186.32Carnegie Mellon University 166.52University of Texas, Austin 122.62

University of Maryland 106.83Microsoft 101.63

AT&T Bell Laboratories 101.37University of California, Irvine 98.17

Georgia Institute of Technology 94.75Massachusetts Institute of Technology 93.24

University of Virginia 81.55

ALL FROM THE USA

PER-VENUE INSIGHTSVenue Peculiarities

TSE Is the venue with more industrial contribution

TOSEM European universities among the top contributors

ICSE Balanced set of contributors we saw in the other venues

ESEC/FSE Despite ESEC, there is no bias towards Europe

ASE Industrial contribution is less relevant.Some affiliations appear only in its top list.

Is Europe more formal?

Is it linked to the presence of empirical works?

It is representative

INDUSTRY VS ACADEMIA

0

0.25

0.50

0.75

1.00

1970 1975 1980 1985 1990 1995 2000 2005

Industry Academia

GEOGRAPHICAL ANALYSIS

Where does the contribution come from?

GEOGRAPHICAL AREAS

North America

Europe

Asia&

Oceania

AfricaSouth

America

LOCATION OF A PAPERAffiliation profileAffiliation profile

Affiliation 1 20%Affiliation 2 30%Affiliation 3 50%

LocationsLocationsAffiliation 1 North AmericaAffiliation 2 EuropeAffiliation 3 Europe

Location profileLocation profile

North America 20%

Europe 80%

GEOGRAPHICAL DISTRIBUTION

0

0.25

0.50

0.75

1.00

1970 1975 1980 1985 1990 1995 2000 2005Europe North America South America Asia & Oceania Africa

CONCLUSION

Academic literature contains a lot of information about a scientific community

With data mining techniques we can unveil it and get some interesting insights

QUESTIONS?