+ All Categories
Home > Technology > Naive application of Machine Learning to Software Development

Naive application of Machine Learning to Software Development

Date post: 27-Jan-2015
Category:
Upload: andriy-khavryuchenko
View: 105 times
Download: 0 times
Share this document with a friend
Description:
Naive application of Machine Learning to Software Development: get tickets from Django trac ticket tracking system and try to predict how long it will take to close the ticket. Facts that developers aren't putting RIGHT information into their tracking systems :)
Popular Tags:
54
Naive application of Machine Learning to Software Development
Transcript
Page 1: Naive application of Machine Learning to Software Development

Naive application of Machine Learning to Software Development

Page 2: Naive application of Machine Learning to Software Development

Naive application of Machine Learning to Software Developmentor... what developers don't tell :)

Page 3: Naive application of Machine Learning to Software Development

What and why42 Coffee Cups:

completely distributed development team

Page 4: Naive application of Machine Learning to Software Development

What and why42 Coffee Cups:

completely distributed development team

Hard facts about how software is done

Page 5: Naive application of Machine Learning to Software Development

What and why42 Coffee Cups:

completely distributed development team

Hard facts about how software is done

LOTS OF THEM

Page 6: Naive application of Machine Learning to Software Development

What and why

Facts

Page 7: Naive application of Machine Learning to Software Development

What and why

Facts Profit

Page 8: Naive application of Machine Learning to Software Development

What and why

Facts Profit???

Page 9: Naive application of Machine Learning to Software Development

What and why

???Toy problem:

get ticket and predict how long it will take to close it

Page 10: Naive application of Machine Learning to Software Development

What and why

???Toy problem:

get ticket and predict how long it will take to close it

Bonus: learn scikit-learn :)

Page 11: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev

Page 12: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev python-numpy python-numpy-dev

Page 13: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev python-numpy python-numpy-dev python-scipy

Page 14: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev python-numpy python-numpy-dev python-scipy python-setuptools libatlas-dev g++

Page 15: Naive application of Machine Learning to Software Development

Install scikit-learn● sudo apt-get install python-

dev python-numpy python-numpy-dev python-scipy python-setuptools libatlas-dev g++

● pip install -U scikit-learn

Page 16: Naive application of Machine Learning to Software Development

Data: closed ticketsimport urllib2

url = \

'https://code.djangoproject.com/query?format=csv' +\

'&col=id&col=time&col=changetime&col=reporter' + \

'&col=summary&col=status&col=owner&col=type' + \'&col=component&order=priority'

tickets = urllib2.urlopen(url).read()

open('2012-10-09.csv','w').write(tickets)

Page 17: Naive application of Machine Learning to Software Development

Data: closed ticketsid,time,changetime,reporter,summary,status,owner,type,component1,2005-07-13 12:03:27,2012-05-20 08:12:37,adrian,Create architecture for anonymous sessions,closed,jacob,enhancement,Core (Other)2,2005-07-13 12:04:45,2007-07-03 16:04:18,anonymous,Calendar popup - next/previous month links close the popup in Safari,closed,jacob,defect,contrib.admin

Page 18: Naive application of Machine Learning to Software Development

Data: closed date and description

def get_data(ticket):

url = 'https://code.djangoproject.com/ticket/%s'\

% ticket

ticket_html = urllib2.urlopen(url)

bs = BeautifulSoup(ticket_html)

Page 19: Naive application of Machine Learning to Software Development

Data: closed date and description

# get closing date

d = bs.find_all('div','date')[0]

p = list(d.children)[3]

href = p.find('a')['href']

close_time_str = urlparse.parse_qs(href)

['/timeline?from'][0]

close_time = datetime.datetime.strptime

(close_time_str[:-6],

'%Y-%m-%dT%H:%M:%S')

# ... more black magic, see code

Page 20: Naive application of Machine Learning to Software Development

Data: closed date and description

def get_data(ticket):

[...]

# get description and return

de = bs.find_all('div', 'description')[0]

return close_time, de.text

Page 21: Naive application of Machine Learning to Software Development

Data: closed date and description

tickets_file = csv.reader(open('2012-10-09.csv'))

output = \

csv.writer(open('2012-10-09.close.csv','w'))

for id, time, changetime, reporter, summary, \

status, owner, type, component in tickets_file:

closetime, descr = get_data(id)

row = [id, time, changetime, closetime, reporter,

summary, status, owner, type, component,

descr.encode('utf-8'), ],)

output.writerow(row)

Page 22: Naive application of Machine Learning to Software Development

Scoring: Train/Test set split

cross_validation.train_test_split

(tickets_train, tickets_test, times_train,

times_test) =

cross_validation.train_test_split(

tickets, times,

test_size=0.2,

random_state=0)

Page 23: Naive application of Machine Learning to Software Development

Scoring: Mean squared error

sklearn.metrics.mean_squared_error

train_error = metrics.mean_squared_error(

times_train, times_train_predict)

test_error = metrics.mean_squared_error(

times_test, times_test_predict)

Page 24: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?for number, created, ... in tickets_file:

row = []

created = dt.datetime.strptime(created,

time_format)

closetime = dt.datetime.strptime(closetime,

time_format)

time_to_fix = closetime - created

row.append(float(number))

tickets.append(row)

times.append(total_seconds(time_to_fix))

Page 25: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?import numpy as np

from sklearn import preprocessing

scaler = preprocessing.Scaler().fit(

np.array(tickets))

tickets = scaler.transform(tickets)

Page 26: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?clf = SVR()

clf.fit(tickets_train, times_train)

times_train_predict = clf.predict(tickets_train)

times_test_predict = clf.predict(tickets_test)

Page 27: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?train_error = metrics.mean_squared_error

(times_train, times_train_predict)

test_error = metrics.mean_squared_error(times_test,

times_test_predict)

print 'Train error: %.1f\n Test error: %.2f' % (

math.sqrt(train_error)/(24*3600),

math.sqrt(test_error)/(24*3600))

# .. in days

Page 28: Naive application of Machine Learning to Software Development

Fun #1: just ticket number?

Train error: 363.4

Test error: 361.41

Page 29: Naive application of Machine Learning to Software Development

Finding best parametersSVM C controls regularization:

larger C leads to ● closer fit to the train data ● with the risk of overfitting

Page 30: Naive application of Machine Learning to Software Development

Finding best parametersCs = np.logspace(-1, 10, 10)

for c in Cs:

learn(c)

Page 31: Naive application of Machine Learning to Software Development

Finding best parameters0.1: Train error: 363.4 Test error: 361.41

1.71: Train error: 363.4 Test error: 361.41

27.8: Train error: 363.4 Test error: 361.39

464.2: Train error: 363.2 Test error: 361.17

7742.6: Train error: 362.5 Test error: 360.41

129155.0: Train error: 362.1 Test error: 360.00

2154434.7: Train error: 362.0 Test error: 359.82

35938136.6: Train error: 361.7 Test error: 359.60

599484250.3: Train error: 361.5 Test error: 359.36

10000000000.0: Train error: 361.1 Test error:

358.91

Page 32: Naive application of Machine Learning to Software Development

Finding best parameterssklearn.grid_search.GridSearchCV

bonus: it can run in parallel

clf = GridSearchCV(estimator=SVR(

param_grid=dict(C=np.logspace(-1,10,10)),

n_jobs=-1)

clf.fit(tickets_train, times_train)

Page 33: Naive application of Machine Learning to Software Development

Finding best parameterssklearn.grid_search.GridSearchCV

bonus: it can run in parallel

clf = GridSearchCV(estimator=SVR(

param_grid=dict(C=np.logspace(-1,10,10)),

n_jobs=-1)

clf.fit(tickets_train, times_train)

Train error: 361.1 Test error: 358.91

Best C: 1.0e+10

Page 34: Naive application of Machine Learning to Software Development

Fun #2: creation date?

row = []

row.append(float(number))

row.append(float(time.mktime(

created.timetuple())))

tickets.append(row)

Page 35: Naive application of Machine Learning to Software Development

Fun #2: creation date?

Train error: 360.6 Test error: 358.39

Best C: 1.0e+10

Page 36: Naive application of Machine Learning to Software Development

String vectorizer and Tfidf transform

from sklearn.feature_extraction.text \

import CountVectorizer, \

TfidfTransformer

Page 37: Naive application of Machine Learning to Software Development

String vectorizer and Tfidf transformreporters = []

for number, ... in tickets_file:

[...]

reporters.append(reporter)

Page 38: Naive application of Machine Learning to Software Development

String vectorizer and Tfidf transformCountVectorizer().fit_transform(reporters) ->

TfidfTransformer().fit_transform( … ) ->

hstack((tickets, …)

note: TF-IDF matrix is sparse!

Page 39: Naive application of Machine Learning to Software Development

String vectorizer and Tfidf transformimport scipy.sparse as sp

tickets = sp.hstack((

tickets,

TfidfTransformer().fit_transform(

CountVectorizer().fit_transform(reporters))))

# remember to re-scale!

scaler = preprocessing.Scaler(with_mean=False

).fit(tickets)

tickets = scaler.transform(tickets)

Page 40: Naive application of Machine Learning to Software Development

Fun #3: reporter

Train error: 338.7 Test error: 353.38

Best C: 1.8e+07

Page 41: Naive application of Machine Learning to Software Development

subjects = []

for number, created, ... in tickets_file:

[...]

subjects.append(summary)

[...]

tickets = sp.hstack((tickets,

TfidfTransformer().fit_transform(

CountVectorizer(ngram_range=(1,3)

).fit_transform(subjects))))

Fun #3: subject

Page 42: Naive application of Machine Learning to Software Development

Train error: 21.0 Test error: XXXX

Best C: 1.0e+10

Fun #3: subject

Page 43: Naive application of Machine Learning to Software Development

Train error: 21.0 Test error: 331.79

Best C: 1.0e+10

Fun #3: subject

Page 44: Naive application of Machine Learning to Software Development

def learn(kernel='rbf', param_grid=None,

verbose=False):

[...]

clf = GridSearchCV(

estimator=SVR(kernel=kernel,

verbose=verbose),

param_grid=param_grid,

n_jobs=-1)

[...]

Different SVM kernels

Page 45: Naive application of Machine Learning to Software Development

RBF

Train error: 21.0 Test error: 331.79

Best C: 1.0e+10

Linear

Train error: 343.1 Test error: 355.56

Best C: 1.0e+02

Different SVM kernels

Page 46: Naive application of Machine Learning to Software Development

components = []

for number, .. component, ... in tickets_file:

[...]

components.append(component)

[...]

tickets = sp.hstack((tickets, TfidfTransformer().

fit_transform(

CountVectorizer().fit_transform(components))))

Fun #5: account for the Component

Page 47: Naive application of Machine Learning to Software Development

RBF

Train error: 18.9 Test error: 327.79

Best C: 1.0e+10

Linear:

Train error: 342.2 Test error: 354.89

Best C: 1.0e+02

Fun #5: account for the Component

Page 48: Naive application of Machine Learning to Software Development

descriptions = []

for number, ... description in tickets_file:

[...]

descriptions.append(description)

[...]

tickets = sp.hstack((tickets, TfidfTransformer().

fit_transform( CountVectorizer(ngram_range=

(1,3)).fit_transform(

descriptions))))

Fun #6: ticket Description

Page 49: Naive application of Machine Learning to Software Development

RBF

Train error: 10.8 Test error: 328.44

Best C: 1.0e+10

Linear

Train error: 14.0 Test error: 331.52

Best C: 3.2e+03

Fun #6: ticket Description

Page 50: Naive application of Machine Learning to Software Development

● All steps of a simple machine learning algo

Conclusions

Page 51: Naive application of Machine Learning to Software Development

● All steps of a simple machine learning algo

● scikit-learn

Conclusions

Page 52: Naive application of Machine Learning to Software Development

● All steps of a simple machine learning algo

● scikit-learn

● data, explicitly available in tickets is NOT ENOUGH to predict closing date

Conclusions

Page 53: Naive application of Machine Learning to Software Development

Developers, what are you hiding?

:)

Page 54: Naive application of Machine Learning to Software Development

Questions?Source code and dataset available at

https://github.com/42/django-trac-learning.git

Contacts:● @akhavr● http://42coffeecups.com/


Recommended