Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | andriy-khavryuchenko |
View: | 105 times |
Download: | 0 times |
Naive application of Machine Learning to Software Development
Naive application of Machine Learning to Software Developmentor... what developers don't tell :)
What and why42 Coffee Cups:
completely distributed development team
What and why42 Coffee Cups:
completely distributed development team
Hard facts about how software is done
What and why42 Coffee Cups:
completely distributed development team
Hard facts about how software is done
LOTS OF THEM
What and why
Facts
What and why
Facts Profit
What and why
Facts Profit???
What and why
???Toy problem:
get ticket and predict how long it will take to close it
What and why
???Toy problem:
get ticket and predict how long it will take to close it
Bonus: learn scikit-learn :)
Install scikit-learn● sudo apt-get install python-
dev
Install scikit-learn● sudo apt-get install python-
dev python-numpy python-numpy-dev
Install scikit-learn● sudo apt-get install python-
dev python-numpy python-numpy-dev python-scipy
Install scikit-learn● sudo apt-get install python-
dev python-numpy python-numpy-dev python-scipy python-setuptools libatlas-dev g++
Install scikit-learn● sudo apt-get install python-
dev python-numpy python-numpy-dev python-scipy python-setuptools libatlas-dev g++
● pip install -U scikit-learn
Data: closed ticketsimport urllib2
url = \
'https://code.djangoproject.com/query?format=csv' +\
'&col=id&col=time&col=changetime&col=reporter' + \
'&col=summary&col=status&col=owner&col=type' + \'&col=component&order=priority'
tickets = urllib2.urlopen(url).read()
open('2012-10-09.csv','w').write(tickets)
Data: closed ticketsid,time,changetime,reporter,summary,status,owner,type,component1,2005-07-13 12:03:27,2012-05-20 08:12:37,adrian,Create architecture for anonymous sessions,closed,jacob,enhancement,Core (Other)2,2005-07-13 12:04:45,2007-07-03 16:04:18,anonymous,Calendar popup - next/previous month links close the popup in Safari,closed,jacob,defect,contrib.admin
Data: closed date and description
def get_data(ticket):
url = 'https://code.djangoproject.com/ticket/%s'\
% ticket
ticket_html = urllib2.urlopen(url)
bs = BeautifulSoup(ticket_html)
Data: closed date and description
# get closing date
d = bs.find_all('div','date')[0]
p = list(d.children)[3]
href = p.find('a')['href']
close_time_str = urlparse.parse_qs(href)
['/timeline?from'][0]
close_time = datetime.datetime.strptime
(close_time_str[:-6],
'%Y-%m-%dT%H:%M:%S')
# ... more black magic, see code
Data: closed date and description
def get_data(ticket):
[...]
# get description and return
de = bs.find_all('div', 'description')[0]
return close_time, de.text
Data: closed date and description
tickets_file = csv.reader(open('2012-10-09.csv'))
output = \
csv.writer(open('2012-10-09.close.csv','w'))
for id, time, changetime, reporter, summary, \
status, owner, type, component in tickets_file:
closetime, descr = get_data(id)
row = [id, time, changetime, closetime, reporter,
summary, status, owner, type, component,
descr.encode('utf-8'), ],)
output.writerow(row)
Scoring: Train/Test set split
cross_validation.train_test_split
(tickets_train, tickets_test, times_train,
times_test) =
cross_validation.train_test_split(
tickets, times,
test_size=0.2,
random_state=0)
Scoring: Mean squared error
sklearn.metrics.mean_squared_error
train_error = metrics.mean_squared_error(
times_train, times_train_predict)
test_error = metrics.mean_squared_error(
times_test, times_test_predict)
Fun #1: just ticket number?for number, created, ... in tickets_file:
row = []
created = dt.datetime.strptime(created,
time_format)
closetime = dt.datetime.strptime(closetime,
time_format)
time_to_fix = closetime - created
row.append(float(number))
tickets.append(row)
times.append(total_seconds(time_to_fix))
Fun #1: just ticket number?import numpy as np
from sklearn import preprocessing
scaler = preprocessing.Scaler().fit(
np.array(tickets))
tickets = scaler.transform(tickets)
Fun #1: just ticket number?clf = SVR()
clf.fit(tickets_train, times_train)
times_train_predict = clf.predict(tickets_train)
times_test_predict = clf.predict(tickets_test)
Fun #1: just ticket number?train_error = metrics.mean_squared_error
(times_train, times_train_predict)
test_error = metrics.mean_squared_error(times_test,
times_test_predict)
print 'Train error: %.1f\n Test error: %.2f' % (
math.sqrt(train_error)/(24*3600),
math.sqrt(test_error)/(24*3600))
# .. in days
Fun #1: just ticket number?
Train error: 363.4
Test error: 361.41
Finding best parametersSVM C controls regularization:
larger C leads to ● closer fit to the train data ● with the risk of overfitting
Finding best parametersCs = np.logspace(-1, 10, 10)
for c in Cs:
learn(c)
Finding best parameters0.1: Train error: 363.4 Test error: 361.41
1.71: Train error: 363.4 Test error: 361.41
27.8: Train error: 363.4 Test error: 361.39
464.2: Train error: 363.2 Test error: 361.17
7742.6: Train error: 362.5 Test error: 360.41
129155.0: Train error: 362.1 Test error: 360.00
2154434.7: Train error: 362.0 Test error: 359.82
35938136.6: Train error: 361.7 Test error: 359.60
599484250.3: Train error: 361.5 Test error: 359.36
10000000000.0: Train error: 361.1 Test error:
358.91
Finding best parameterssklearn.grid_search.GridSearchCV
bonus: it can run in parallel
clf = GridSearchCV(estimator=SVR(
param_grid=dict(C=np.logspace(-1,10,10)),
n_jobs=-1)
clf.fit(tickets_train, times_train)
Finding best parameterssklearn.grid_search.GridSearchCV
bonus: it can run in parallel
clf = GridSearchCV(estimator=SVR(
param_grid=dict(C=np.logspace(-1,10,10)),
n_jobs=-1)
clf.fit(tickets_train, times_train)
Train error: 361.1 Test error: 358.91
Best C: 1.0e+10
Fun #2: creation date?
row = []
row.append(float(number))
row.append(float(time.mktime(
created.timetuple())))
tickets.append(row)
Fun #2: creation date?
Train error: 360.6 Test error: 358.39
Best C: 1.0e+10
String vectorizer and Tfidf transform
from sklearn.feature_extraction.text \
import CountVectorizer, \
TfidfTransformer
String vectorizer and Tfidf transformreporters = []
for number, ... in tickets_file:
[...]
reporters.append(reporter)
String vectorizer and Tfidf transformCountVectorizer().fit_transform(reporters) ->
TfidfTransformer().fit_transform( … ) ->
hstack((tickets, …)
note: TF-IDF matrix is sparse!
String vectorizer and Tfidf transformimport scipy.sparse as sp
tickets = sp.hstack((
tickets,
TfidfTransformer().fit_transform(
CountVectorizer().fit_transform(reporters))))
# remember to re-scale!
scaler = preprocessing.Scaler(with_mean=False
).fit(tickets)
tickets = scaler.transform(tickets)
Fun #3: reporter
Train error: 338.7 Test error: 353.38
Best C: 1.8e+07
subjects = []
for number, created, ... in tickets_file:
[...]
subjects.append(summary)
[...]
tickets = sp.hstack((tickets,
TfidfTransformer().fit_transform(
CountVectorizer(ngram_range=(1,3)
).fit_transform(subjects))))
Fun #3: subject
Train error: 21.0 Test error: XXXX
Best C: 1.0e+10
Fun #3: subject
Train error: 21.0 Test error: 331.79
Best C: 1.0e+10
Fun #3: subject
def learn(kernel='rbf', param_grid=None,
verbose=False):
[...]
clf = GridSearchCV(
estimator=SVR(kernel=kernel,
verbose=verbose),
param_grid=param_grid,
n_jobs=-1)
[...]
Different SVM kernels
RBF
Train error: 21.0 Test error: 331.79
Best C: 1.0e+10
Linear
Train error: 343.1 Test error: 355.56
Best C: 1.0e+02
Different SVM kernels
components = []
for number, .. component, ... in tickets_file:
[...]
components.append(component)
[...]
tickets = sp.hstack((tickets, TfidfTransformer().
fit_transform(
CountVectorizer().fit_transform(components))))
Fun #5: account for the Component
RBF
Train error: 18.9 Test error: 327.79
Best C: 1.0e+10
Linear:
Train error: 342.2 Test error: 354.89
Best C: 1.0e+02
Fun #5: account for the Component
descriptions = []
for number, ... description in tickets_file:
[...]
descriptions.append(description)
[...]
tickets = sp.hstack((tickets, TfidfTransformer().
fit_transform( CountVectorizer(ngram_range=
(1,3)).fit_transform(
descriptions))))
Fun #6: ticket Description
RBF
Train error: 10.8 Test error: 328.44
Best C: 1.0e+10
Linear
Train error: 14.0 Test error: 331.52
Best C: 3.2e+03
Fun #6: ticket Description
● All steps of a simple machine learning algo
Conclusions
● All steps of a simple machine learning algo
● scikit-learn
Conclusions
● All steps of a simple machine learning algo
● scikit-learn
● data, explicitly available in tickets is NOT ENOUGH to predict closing date
Conclusions
Developers, what are you hiding?
:)
Questions?Source code and dataset available at
https://github.com/42/django-trac-learning.git
Contacts:● @akhavr● http://42coffeecups.com/