+ All Categories
Home > Technology > Predicting More from Less: Synergies of Learning

Predicting More from Less: Synergies of Learning

Date post: 06-May-2015
Category:
Upload: cs-ncstate
View: 186 times
Download: 0 times
Share this document with a friend
Description:
Ekrem Kocaguneli, [email protected] Bojan Cukic, [email protected], Huihua Lu, [email protected]
24
Predicting More from Less: Synergies of Learning Ekrem Kocaguneli, [email protected] Bojan Cukic, [email protected], Huihua Lu, [email protected] RAISE'13 2nd International NSF sponsored Workshop on Realizing Artificial Intelligence Synergies in Software Engineering 5/25/2013 RAISE'13
Transcript
Page 1: Predicting More from Less: Synergies of Learning

Predicting More from Less: Synergies of Learning

Ekrem Kocaguneli, [email protected] Cukic, [email protected],

Huihua Lu, [email protected]

RAISE'13 2nd International NSF sponsored Workshop on Realizing Artificial Intelligence Synergies in Software Engineering5/25/2013

RAISE'13

Page 2: Predicting More from Less: Synergies of Learning

2

Collecting data is important

SourceForge currently hosts 324K projects with a user base of 3.4M1

GoogleCode hosts 250K open source projects2

1. http://sourceforge.net/apps/trac/sourceforge/wiki/What%20is%20SourceForge.net2. https://developers.google.com/open-source/

Page 3: Predicting More from Less: Synergies of Learning

3

Also, there is an abundant amount of SE repositories

ISBSG1 PROMISE2

Eclipse Bug Data3 TukuTuku4

1. C. Lokan, T. Wright, P. Hill, and M. Stringer. Organizational bench- marking using the ISBSG data repository. IEEE Software, 18(5):26–32, 2001.

2. T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June 2012.

3. T. Zimmermann, R. Premraj, and A. Zeller. Predicting defects for eclipse. In International Workshop on Predictor Models in Software Engineering, 2007. PROMISE’07: ICSE Workshops 2007.

4. http://www.metriq.biz/tukutuku/

Page 4: Predicting More from Less: Synergies of Learning

4

We have mountains of data, but then what?

Page 5: Predicting More from Less: Synergies of Learning

5

Abundance of data is promising for predictive modeling and supervised learning

Yet, dependent variable information is

not always available!

Dependent variables (labels, effort values

etc.) may be missing, outdated or available for a limited number of instances

Page 6: Predicting More from Less: Synergies of Learning

6

When an organization has no local data or the local data is outdated, transferring data helps

When only a limited amount of data is labeled, we can use the existing labels to label other training instances

When no labels exist, we can request labels from experts with a cost

Transfer learning

Semi-supervised learning

Active learning

Page 7: Predicting More from Less: Synergies of Learning

7

How to transfer data data between domains and projects?

How to accommodate prediction problems for which a limited amount of labeled instances are available?

How to handle prediction problems in which no instances have labels?

Transfer learning

Semi-supervised learning

Active learning

Page 8: Predicting More from Less: Synergies of Learning

8

What is the current state-of-the-art?

Page 9: Predicting More from Less: Synergies of Learning

9

Transfer learning is a set of learning methods that allow the training and test sets to have different domains and/or tasks (Ma2012 [1]).

Transfer learning - 1

[1] Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross- company software defect prediction. Information and Software Technol- ogy, 54(3):248 – 256, 2012.

SE transfer learning studies (a.k.a. cross-company learning) have the same task yet different domains (data coming from different organizations or different time frames).

Page 10: Predicting More from Less: Synergies of Learning

10

Transfer learning results in SE report instability and significant variability if data is used as-is (Kitchenham2007 [1], Zimmermann2009[2])

Transfer learning - 2

[1] B.A.Kitchenham,E.Mendes,andG.H.Travassos.Crossversuswithin- company cost estimation studies: A systematic review. IEEE Trans. Softw. Eng., 33(5):316–329, 2007. [2] T.Zimmermann,N.Nagappan,H.Gall,E.Giger,andB.Murphy.Cross- project defect prediction: A large scale experiment on data vs. domain vs. process. ESEC/FSE, pages 91–100, 2009.[3] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540–578, 2009. [4] E. Kocaguneli and T. Menzies. How to find relevant data for effort es- timation. In ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011.

Filtering-based approaches support prior results (Turhan2009[3], Kocaguneli2011[4])• Transferring all cross data yields poor performance• Filtering cross data significantly improves estimation

Page 11: Predicting More from Less: Synergies of Learning

11

SSL methods are a group of machine learning algorithms that learn from a set of training instances among which only a small subset has pre-assigned labels [1].

Semi-supervised learning (SSL) -1

[1] O. Chapelle, B. Schlkopf, and A. Zien. Semi-supervised Learning. MIT Press, Cambridge, MA, USA, 2006.

SSL helps relax the dependent variable dependence of supervised methods

Hence, we can supplement supervised estimation methods.

Page 12: Predicting More from Less: Synergies of Learning

12

Despite the promise, SSL appears to be less than thoroughly investigated in SE

Semi-supervised learning (SSL) - 2

[1] Huihua Lu, Bojan Cukic, and Mark Culp. 2012. Software defect prediction using semi-supervised learning with dimension reduction. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE 2012).[2] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19:201–230, 2012.

Lu et al. use an SSL algorithm augmented with multi-dimensional scaling (MDS) as pre-processor, which outperforms corresponding supervised methods

Li et al. developed a framework which maps ensemble learning and random forests into an SSL setting [19].

Page 13: Predicting More from Less: Synergies of Learning

13

AL methods are unsupervised methods working on an initially unlabeled data set.

Active Learning (AL) - 1

[1] M.-F.Balcan, A.Beygelzimer, andJ.Langford. “Agnostic active learning”. Proceedings of the 23rd international conference on Machine learning - ICML ’06, pages 65–72, 2006.

AL methods can query an oracle, which can provide labels. Yet, each label comes with a cost. Hence, we need as few queries as possible.

e.g. Balcan et al. show AL provides the same performance as a supervised learner with substantially smaller samples sizes [1]

Page 14: Predicting More from Less: Synergies of Learning

14

In SE, AL methods hold a good potential to reduce the labeling costs

Active Learning (AL) - 2

[1] Huihua Lu and Bojan Cukic. 2012. An adaptive approach with active learning in software fault prediction. In Proceedings of the 8th International Conference on Predictive Models in Software Engineering (PROMISE '12). [2] Kocaguneli, E.; Menzies, T.; Keung, J.; Cok, D.; Madachy, R., "Active Learning and Effort Estimation: Finding the Essential Content of Software Effort Estimation Data," Software Engineering, IEEE Transactions on , vol.PP, no.99, pp.1,1, 0

Lu et al. propose an AL-based fault prediction method, which outperforms supervised techniques by using 20% or less of the data [1]

Kocaguneli et al. use AL in SEE. The proposed method performs comparable to supervised methods with 31% of the original data [2]

Page 15: Predicting More from Less: Synergies of Learning

15

So what do we do?

Page 16: Predicting More from Less: Synergies of Learning

16

Strengths and WeaknessesSupervised Learning (SL)Strengths• Successfully used in SE for predictive

purposes.• Provides successful estimation

performance.Challenges• Requires retrospective local data.• Requires dependent variable

information.

Transfer Learning (TL)Strengths• Enables data to be transferred between

different organizations or time frames.• Provides a solution to the lack of local data.• After relevancy filtering, cross data can

perform as well as within data.Challenges• Use of cross-data in an as is manner results in

unstable performance results.• TL filters relevant cross data, which reduces

the transferred cross data amount.

Semi-supervised Learning (SSL)Strengths• Enables learning from small sets of labeled

instances.• Supplements the learning with unlabeled instances.• Relaxes the requirement of dependent variables.Challenges• Although being small, it still requires an initially

labeled set of training instances.• For datasets with large number of independent

features, it requires feature subset selection.

Active Learning (AL)Strengths• Helps find the essential content of the data.• Decreases the number of dependent variable

information, thereby reducing the associated data collection costs.

Challenges• Susceptible to unbalanced class distributions

in classification problems.

Page 17: Predicting More from Less: Synergies of Learning

17

Strengths and WeaknessesSupervised Learning (SL)

•Requires retrospective local data.•

Transfer Learning (TL)

• Provides a solution to the lack of local data.•

• TL filters relevant cross data, which reduces the transferred cross data amount.

Semi-supervised Learning (SSL)

• Enables learning from small sets of labeled instances.

••

Active Learning (AL)

• Helps find the essential content of the data.•

1

2

3

Page 18: Predicting More from Less: Synergies of Learning

18

Synergy #1

Synergy #1 is already being pursued in SE

With successful applications of transferring data among:• Domain• Time frame

Page 19: Predicting More from Less: Synergies of Learning

19

Filtering labeled cross data yields a very limited amount of locally relevant data

SSL can use filtered cross data to provide pseudo-labels for the unlabeled within data

Synergy #2

Page 20: Predicting More from Less: Synergies of Learning

20

SE data (defect and effort) can be summarized with its essential content

Transfer learning may benefit from using essential content instead of all the data, which may contain noise and outliers

Synergy #3

Page 21: Predicting More from Less: Synergies of Learning

21

Did you try any of the synergies?

Page 22: Predicting More from Less: Synergies of Learning

22

Experiments with Synergy #3

Page 23: Predicting More from Less: Synergies of Learning

23

Experiments with Synergy #3

Estimation from pseudo-labeled within data

Within data is summarized to at most 15%

Opportunity for within data to be locally interpreted

Page 24: Predicting More from Less: Synergies of Learning

24

What have we covered?


Recommended