Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | doris-morton |
View: | 216 times |
Download: | 0 times |
Regression
• Same problem as classification except that the target variable yi is continuous.
• Popular solutions– Linear regression (perceptron)– Support vector regression– Logistic regression (for regression)
Linear regression• Suppose target values are generated by a
function yi = f(xi) + ei
• We will estimate f(xi) by g(xi,θ). • Suppose each ei is being generated by a Gaussian
distribution with 0 mean and σ2 variance (same variance for all ei).
• This implies that the probability of yi given the input xi and variables θ (denoted as p(yi|xi,θ) is normally distributed with mean g(xi,θ) and variance σ2.
Linear regression• Apply maximum likelihood to estimate g(x, θ)• Assume each (xi,yi) i.i.d.
• Then probability of data given model (likelihood) is P(X|θ) = p(x1,y1)p(x2,y2)…p(xn,yn)
• Each p(xi,yi)=p(yi|xi)p(xi)
• p(yi|xi) is normally distributed with meang(xi,θ) and variance σ2
• Maximizing the log likelihood (like for classification) gives us least squares (linear regression)
Logistic regression
• Similar to linear regression derivation• Minimize sum of squares between predicted
and actual value• However – predicted is given by sigmoid function and– yi is constrained in the range [0,1]
Support vector regression
• Makes no assumptions about probability distribution of the data and output (like support vector machine).
• Change the loss function in the support vector machine problem to the e-sensitive loss to obtain support vector regression
Support vector regression• Solved by applying Lagrange multipliers like in
SVM• Solution w is given by a linear combination of
support vectors (like in SVM)• The solution w can also be used for ranking
features.• From regularized risk minimization the loss
would be 0
1
1max(0, | ( ) | )
nT
i ii
y w x wn
Application
• Prediction of continuous phenotypes in mice from genotype (Predicting unobserved phen…)
• Data are vectors xi where each feature takes on values 0, 1, and 2 to denote number of alleles of a particular single nucleotide polymorphism (SNP)
• Data has about 1500 samples and 12,000 SNPs• Output yi is a phenotype value. For example coat
color (represented by integers), chemical levels in blood
Mouse phenotype prediction from genotype
• Rank SNPs by Wald test– First perform linear regression y = wx + w0
– Calculate p-value on w using t-test• t-test: (w-wnull)/stderr(w))• wnull = 0• T-test: w/stderr(w)• stderr(w) given by Σi(yi-wxi-w0)2 /(xi-mean(xi))
– Rank SNPs by p-values– OR by Σi(yi-wxi-w0)
• Rank SNPs by Pearson correlation coefficient• Rank SNPs by support vector regression (w vector in SVR)• Rank SNPs by ridge regression (w vector)• Run SVR and ridge regression on top k ranked SNP under cross-
validation.
Rice phenotype prediction from genotype
• Same experimental study as previously• Improving the Accuracy of Whole Genome Pre
diction for Complex Traits Using the Results of Genome Wide Association Studies
• Data has 413 samples and 37,000 SNPs (features)
• Basic unbiased linear prediction (BLUP) method improved by prior SNP knowledge (given in genome-wide association studies)