Date post: | 14-Apr-2018 |
Category: |
Documents |
Upload: | munish-mehra |
View: | 223 times |
Download: | 0 times |
of 24
7/27/2019 Regression machine learning
1/24
Regression on Page Relevancy
CSE4/574 Machine Learning
TA: Zhen [email protected]
7/27/2019 Regression machine learning
2/24
Web search ranking
Goal: given queries and a documents/urls, estimate the Web searchresults (relevance) of the pages to the queries.
Ranking the pages via a relevance function.
Rankingurl pages
1
2
4
rankingresult
query
7/27/2019 Regression machine learning
3/24
Regression on Page Relevancy
Not Ranking!!
Goal: Train a regression model based on query-url pair datasets , then
predict the page relevancy labels for new coming queries.
Binary / multiple levels of relevance (Bad, Fair, Good, Excellent, Perfect, ...)
Model url pages
3
2
4
relevance
levels
query
7/27/2019 Regression machine learning
4/24
Datasets
Large scale real world learning to rank (LTR) datasets that has beenreleased:
Queries Doc. Rel. Feat. Year
Letor3.0 Gov 575 568k 2 64 2008
Letor3.0 Ohsumed 106 16k 3 45 2008
Letor4.0 2476 85k 3 46 2009
Yandex 20267 213k 5 245 2009
Yahoo 36251 883k 5 700 2010
7/27/2019 Regression machine learning
5/24
Letor4.0 Dataset
The latest version, 4.0, can be found athttp://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx
(It contains 8 datasets for four ranking settings derived from the two query
sets and the Gov2 web page collection.)
LETOR is a package of benchmark data sets for research on Learning ToRank released by Microsoft Research Asia.
For this project, one dataset of MQ2008 is used (supervised ranking):
Querylevelnorm.txt (15211 urls/samples in total)
http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspxhttp://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2008.rarhttp://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2008.rarhttp://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspxhttp://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspxhttp://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx7/27/2019 Regression machine learning
6/24
Letor4.0 DatasetSample rows from the MQ2008 dataset:
Judgments {0; 1; 2; 3; 4} (Bad, Fair, Good, Excellent, Perfect).
7/27/2019 Regression machine learning
7/24
Letor4.0 DatasetSample rows from the MQ2008 dataset:
1. The first column is relevance label of this pair. The larger the relevance label,
the more relevant the query-document pair.
2. The second column is query id,
3. The following 46 columns are features. A query-document pair is represented
by a 46-dimensional feature vector of real numbers in the range 0 to 1.
4. The end of the row is a comment about the pair, including id of the document.
Judgments {0; 1; 2}
7/27/2019 Regression machine learning
8/24
Features
Given a query and a document, construct
a feature vector (normalized between 0 and 1)
7/27/2019 Regression machine learning
9/24
Import Data Set
Matlab function: fopen, textscan, strfind, etc.
Read by line
File -> Import Data
>> line_string = importedData{1} % imported data is nx1 cell
or
>> fid = fopen(dataset.txt);
>> data = textscan(fid, %*^\n+); % read by lines, data is 1x1 cell
>> line_string = data{1}{1};
Example of line in string
7/27/2019 Regression machine learning
10/24
Process Data Set (i)
2 qid:10002 1:0.007477 2:0.000000 3:1.000000 4:0.000000 5:0.007470 46:0.007042 #docid =GX008-86-4444840 inc = 1 prob = 0.086622
2 qid:10002 1:0.007477 2:0.000000 3:1.000000 4:0.000000 5:0.007470 46:0.007042 #docid =
GX008-86-4444840 inc = 1 prob = 0.086622
LETOR 4.0
Process the original data into a matrix containing relevance labels (thefirst column) and feature vectors. This input matrix (training data) will be
feed into your regression model.
7/27/2019 Regression machine learning
11/24
Process Data Set (ii)
Relevancy
labelsFeature Vectors
2 0.3 0.45 0.12 0.89
Dateset
train
validation
test
1-dimension M-1 dimension
N X M
For LETOR 4.0, you need partition the data set into three subsets.
7/27/2019 Regression machine learning
12/24
Train/Validation/Test Sets
Relevancy
labelsFeature Vectors
Dateset
train
validation
test
1-dimension M-1 dimension N X M
Leave out asground truth!
7/27/2019 Regression machine learning
13/24
Linear Regression
Problem: We want a general way of obtaining a linear model (model islinear in the parameters) that fitted to observed data.
wxwx, )()()(
1
1
0
M
j
jj xwwy
Typically, 0(x) = 1, so that w0 acts as a bias parameter.
In the simplest case, we use linear basis functions : j(x) = xj.
General set up:
Given a set of training examples (xn, tn), n =1, N
Goal: learn a function y(x) to minimize someloss function (error function): E(y,t)
Linear Basis function Model:
7/27/2019 Regression machine learning
14/24
Linear Regression
ww, )(xy
Nx
x
x
2
1
x
Nt
t
t
2
1
t
)()()(
)()()(
)()()(
)(
110
212120
111110
NMNN
M
M
xxx
xxx
xxx
x
N x M design matrix
a single data
a basis function
1
1
0
Mw
w
w
w
t)t)ty, T -wwE (()( tw*
t)
TT
T
-
1)(
0(
wEw
Estimation:
Squared Error function:Least squares solution:
)(argmin ty,w* Ew
Minimize error:
7/27/2019 Regression machine learning
15/24
Linear Basis Function Models
wxxwx, )()()(1
0
M
j
jjwy
2
2
2
)(exp)(
s
x jj
xjj x)(x
Polynomial Gaussian Sigmoid
s
x jj
)(x
)exp(1
1)(
aa
7/27/2019 Regression machine learning
16/24
Linear Regression for Project
Project Goal: To predict the value of one or more continuous targetvariables tgiven the value of a D-dimensional vector xof input variables.
One dimensional:
D = 1 (already encountered)
D
nnn
D
D
xxx
xxx
xxx
...
...
21
2
2
2
1
2
1
2
1
1
1
x
nt
t
t
2
1
t
?
1
0
w
ww
wFind
7/27/2019 Regression machine learning
17/24
Linear Regression for Project
Polynomial Basis Function (not required) jj x)(x
Different orders
of polynomial
Sum over
D dimension
112112222121
1
1
12
1
11
1
2
1
22
1
21
11
2
1
1
1
)(,...,)(,)(,...,)(,...,)(,)(,,...,,,1
)(,...,)(,)(,...,)(,...,)(,)(,,...,,,1
)(
MD
N
M
N
M
N
D
NNN
D
NNN
MDMMDD
xxxxxxxxx
xxxxxxxxx
x
N x ((M-1)xD + 1) matrix
w: (M-1)xD+1 dimension weight vector
1
1 1
),(0 )()(M-
j
D
i
ijji xwwy wx,
7/27/2019 Regression machine learning
18/24
Linear Regression for Project
Gaussian Basis Function
1
1 1
),(0 )()(M-
j
D
i
ijji xwwy wx,
Different Gaussian
parameter settings
Sum over
D dimension
)(),...,(),(),...(),...,(),(),(),...,(),(,1
)(),...,(),(),...(),...,(),(),(),...,(),(,1
)(
1
2
1
1
12
2
2
1
21
2
1
1
1
11
2
11
1
1112
2
12
1
1211
2
11
1
11
D
NMNMNM
D
NNN
D
NNN
D
MMM
DD
xxxxxxxxx
xxxxxxxxx
x
2
2
2)(exp)(
sx j
j x
N x ((M-1)xD + 1) matrix
Sigmoid basis function: similar to Gaussian
w: (M-1)xD+1 dimension weight vector
7/27/2019 Regression machine learning
19/24
Overfitting Issue
What can we do to curb overfitting?
Use less complex model
Use more training examples
Regularization
7/27/2019 Regression machine learning
20/24
Regularized Least Square
)()()( www WD EEE
Squared Error function:
Regularized Least squares solution:
)(argmin ww* Ew
Minimize error:
Add regularization term to error function to control over-fitting:
wwt)wt)ww TT 2
1(()( -E
tIw*wt)w TTT 1)(( Ew
encourage small
weight values!
Regularization termData dependent term
7/27/2019 Regression machine learning
21/24
Experimental Phases
Determine format
of your model
Train the model
you have selected
learn weights w
Adjusting following:
# of basis func.
Regularization Hyperparameter ,
etc.
Evaluating the
final model
Report test errorModel
Unacceptable validation error
Training Validation Test
Model with
tunedparameters
7/27/2019 Regression machine learning
22/24
Experimental Phases
Determine format
of your model
Train the model
you have selected
learn weights w
Adjusting following:
# of basis func.
Regularization Hyperparameter ,
etc.
Evaluating the
final model
Report test errorModel
Unacceptable validation error
Training Validation Test
Model with
tunedparameters
Optimal solution? Model complexity?
7/27/2019 Regression machine learning
23/24
Evaluation Metrics
Express results as Root Mean Square Error: ERMS
N
E
ED
RMS
)(2
)(
w
w
N: number of data in data set
ED(w): sum of square error function
(data-dependent error)
7/27/2019 Regression machine learning
24/24
Project Report
Explain the problem and how you choose your model.
Elaborate your validating process.
- The intuitive choice of parameters)
There are no limitation on setting parameters and there could be infinity choices.You can define some range or choose some specific values.
- Description of how you went about avoiding overfitting.
Generate graphs showing how error changes with the
adjusting of parameters.
Report final result and evaluating model performance.