+ All Categories
Home > Documents > Transfer Learning with Applications to Text Classification

Transfer Learning with Applications to Text Classification

Date post: 11-Jan-2016
Category:
Upload: zaria
View: 29 times
Download: 0 times
Share this document with a friend
Description:
Jing Peng Computer Science Department. Transfer Learning with Applications to Text Classification. Machine learning: study of algorithms that improve performance P on some task T using experience E Well defined learning task: . Learning to recognize targets in images:. - PowerPoint PPT Presentation
Popular Tags:
67
Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department
Transcript
Page 1: Transfer Learning with Applications  to Text Classification

Transfer Learning with Applications to Text Classification

Jing PengComputer Science Department

Page 2: Transfer Learning with Applications  to Text Classification

Machine learning:

study of algorithms that

① improve performance P② on some task T③ using experience E

Well defined learning task: <P,T,E>

Page 3: Transfer Learning with Applications  to Text Classification

Learning to recognize targets in images:

Page 4: Transfer Learning with Applications  to Text Classification

Learning to classify text documents:

Page 5: Transfer Learning with Applications  to Text Classification

Learning to build forecasting models:

Page 6: Transfer Learning with Applications  to Text Classification

Growth of Machine Learning

Machine learning is preferred approach to

① Speech processing② Computer vision③ Medical diagnosis④ Robot control⑤ News articles processing⑥ …

This machine learning niche is growing

① Improved machine learning algorithms② Lots of data available③ Software too complex to code by hand④ …

Page 7: Transfer Learning with Applications  to Text Classification

Learning Given Least squares methods

Learning focuses on minimizing

inm

iii yyxz , with x, i1

m

iii

fxfy

mf

1

21minarg

xdffX

2

. target therepresents f

:approximation errorH

X

H xdff 2

:estimation error),( HzS

Hf

f

f xdfffX

HH

2min

X

H

X

xdffHzSxdff 22 ),(

xdffxdffHzSX HX

22),(

Page 8: Transfer Learning with Applications  to Text Classification

Main Challenge:1. Transfer learning2. High Dimensional (4000 features)3. Overlapping (<80% features are the same)4. Solution with performance bounds

Transfer Learning with Applications to Text Classification

Page 9: Transfer Learning with Applications  to Text Classification

Standard Supervised Learning

New York Times

training (labeled)

test (unlabeled)

Classifier 85.5%

New York Times

Page 10: Transfer Learning with Applications  to Text Classification

In Reality……

New York Times

training (labeled)

test (unlabeled)

Classifier 64.1%

New York Times

Labeled data not available!Reuters

Page 11: Transfer Learning with Applications  to Text Classification

Domain Difference Performance Droptrain test

NYT NYT

New York Times New York Times

Classifier 85.5%

Reuters NYT

Reuters New York Times

Classifier 64.1%

ideal setting

realistic setting

Page 12: Transfer Learning with Applications  to Text Classification

High Dimensional Data Transfer High Dimensional Data:

Text Categorization Image Classification

The number of features in our experiments is more than 4000

Challenges: High dimensionality.

more than training examples Euclidean distance becomes meaningless

Page 13: Transfer Learning with Applications  to Text Classification

Why Dimension Reduction?

DMAXDMAX

DMINDMIN

Page 14: Transfer Learning with Applications  to Text Classification

Curse of Dimensionality

DimensionsDimensions

Page 15: Transfer Learning with Applications  to Text Classification

Curse of Dimensionality

DimensionsDimensions

6104 8108

Page 16: Transfer Learning with Applications  to Text Classification

High Dimensional Data Transfer High Dimensional Data:

Text Categorization Image Classification

The number of features in our experiments is more than 4000

Challenges: High dimensionality.

more than training examples Euclidean distance becomes meaningless

Feature sets completely overlapping?No. Some less than 80% features are the same.Marginally not so related?Harder to find transferable structuresProper similarity definition.

Page 17: Transfer Learning with Applications  to Text Classification

PAC (Probably Approximately Correct) learning requirement

Training and test distributions must be the same

Page 18: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

• Overlapping DistributionsData from two domains may not come from the same part of space; potentially overlap at best.

Page 19: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

• Overlapping Distribution

A ? 1 0.2 +1

Data from two domains may not come from the same part of space; potentially overlap at best.

B 0.09 ? 0.1 +1

C 0.01 ? 0.3 -1

x y z label

Page 20: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

• Overlapping Distribution

A ? 1 0.2 +1

Data from two domains may not come from the same part of space; potentially overlap at best.

B 0.09 ? 0.1 +1

C 0.01 ? 0.3 -1

x y z label

Page 21: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

• Overlapping Distribution

A ? 1 0.2 +1

Data from two domains may not be lying on exactly the same space, but at most an overlapping one.

B 0.09 ? 0.1 +1

C 0.01 ? 0.3 -1

x y z label

Page 22: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

• Overlapping Distribution

A ? 1 0.2 +1

Data from two domains may not be lying on exactly the same space, but at most an overlapping one.

B 0.09 ? 0.1 +1

C 0.01 ? 0.3 -1

x y z label

Page 23: Transfer Learning with Applications  to Text Classification

Problems with overlapping distributions Overlapping features alone may not provide

sufficient predictive power

Transfer between high dimensional overlapping distributions

Page 24: Transfer Learning with Applications  to Text Classification

Problems with overlapping distributions Overlapping features alone may not provide

sufficient predictive power

Transfer between high dimensional overlapping distributions

A ? 1 0.2 +1

B 0.09 ? 0.1 +1

C 0.01 ? 0.3 -1

f1 f2 f3 label

Page 25: Transfer Learning with Applications  to Text Classification

Problems with overlapping distributions Overlapping features alone may not provide

sufficient predictive power

Transfer between high dimensional overlapping distributions

A ? 1 0.2 +1

B 0.09 ? 0.1 +1

C 0.01 ? 0.3 -1

f1 f2 f3 label

Page 26: Transfer Learning with Applications  to Text Classification

Problems with overlapping distributions Overlapping features alone may not provide

sufficient predictive power

Transfer between high dimensional overlapping distributions

A ? 1 0.2 +1

B 0.09 ? 0.1 +1

C 0.01 ? 0.3 -1

f1 f2 f3 labelHard to predict correctly

Page 27: Transfer Learning with Applications  to Text Classification

Overlapping Distributions Use the union of all features and fill in

missing values with “zeros”?

Transfer between high dimensional overlapping distributions

Page 28: Transfer Learning with Applications  to Text Classification

Overlapping Distributions Use the union of all features and fill in missing

values with “zeros”?

Transfer between high dimensional overlapping distributions

A 0 1 0.2 +1

B 0.09 0 0.1 +1

C 0.01 0 0.3 -1

f1 f2 f3 label

Page 29: Transfer Learning with Applications  to Text Classification

Overlapping Distribution Use the union of all features and fill in the missing

values with “zeros”?

Transfer between high dimensional overlapping distributions

A 0 1 0.2 +1

B 0.09 0 0.1 +1

C 0.01 0 0.3 -1

f1 f2 f3 label

Does it helps?

Page 30: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

Page 31: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

D2 { A, B} = 0.0181

>

D2 {A, C} = 0.0101

Page 32: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

D2 { A, B} = 0.0181

>

D2 {A, C} = 0.0101

A is mis-classified as in the class of C, instead

of B

Page 33: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

When one uses the union of overlapping and non-overlapping features and replaces missing values with “zero”, distance of two marginal distributions p(x) can

become asymptotically very large as a function of non-overlapping features:

becomes a dominant factor in similarity measure.

Page 34: Transfer Learning with Applications  to Text Classification

High dimensionality can underpin important features

Transfer between high dimensional overlapping distributions

Page 35: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

Page 36: Transfer Learning with Applications  to Text Classification

Transfer between high dimensional overlapping distributions

The “blues” are closer to the “greens” than to

the “reds”

Page 37: Transfer Learning with Applications  to Text Classification

LatentMap: two step correction

Missing value regression Bring marginal distributions closer

Latent space dimensionality reduction Further bring marginal distributions closer Ignore non-important noisy and “error imported

features” Identify transferable substructures across two

domains.

Page 38: Transfer Learning with Applications  to Text Classification

Predict missing values (recall the previous example)

Missing Value Regression

Page 39: Transfer Learning with Applications  to Text Classification

Predict missing values (recall the previous example)

Missing Value Regression

Page 40: Transfer Learning with Applications  to Text Classification

Predict missing values (recall the previous example)

Missing Value Regression

1. Project to overlapped feature

Page 41: Transfer Learning with Applications  to Text Classification

Predict missing values (recall the previous example)

Missing Value Regression

1. Project to overlapped feature

2. Map from z to xRelationship

found byregression

Page 42: Transfer Learning with Applications  to Text Classification

Predict missing values (recall the previous example)

Missing Value Regression

1. Project to overlapped feature

2. Map from z to xRelationship

found byregression

Page 43: Transfer Learning with Applications  to Text Classification

Predict missing values (recall the previous example)

Missing Value Regression

1. Project to overlapped feature

2. Map from z to xRelationship

found byregression

D { img(A’), B} = 0.0109

<

D {img(A’), C} = 0.0125

Page 44: Transfer Learning with Applications  to Text Classification

Predcit missing values (recall the previous example)

Missing Value Regression

1. Project to overlapped feature

2. Map from z to xRelationship

found byregression

D { img(A’), B} = 0.0109

<

D {img(A’), C} = 0.0125

A is correctlyclassified

as in the same class as B

Page 45: Transfer Learning with Applications  to Text Classification

out-domainword vectors

in-domainword vectors

X

Dimensionality Reduction

Page 46: Transfer Learning with Applications  to Text Classification

out-domainword vectors

in-domainword vectors

X

Dimensionality Reduction

Missing Values

Page 47: Transfer Learning with Applications  to Text Classification

out-domainword vectors

in-domainword vectors

X

Dimensionality Reduction

Overlapping Features

Missing Values

Page 48: Transfer Learning with Applications  to Text Classification

out-domainword vectors

in-domainword vectors

X

Dimensionality Reduction

Missing Values Filled

Overlapping Features

Missing Values

Page 49: Transfer Learning with Applications  to Text Classification

out-domainword vectors

in-domainword vectors

X

Dimensionality Reduction

Missing Values Filled

Overlapping Features

Missing Values

Word vector Matrix

Page 50: Transfer Learning with Applications  to Text Classification

Dimensionality Reduction

• Project the word vector matrix to the most important and inherent sub-space

Page 51: Transfer Learning with Applications  to Text Classification

Dimensionality Reduction

• Project the word vector matrix to the most important and inherent sub-space

=

d×t

XVk

UT

Σ-1

Page 52: Transfer Learning with Applications  to Text Classification

Dimensionality Reduction

• Project the word vector matrix to the most important and inherent sub-space

=

d×t

XVk

UT

Σ-1

Low dimensional

representation

Page 53: Transfer Learning with Applications  to Text Classification

Solution (high dimensionality)

recall the previous example

Page 54: Transfer Learning with Applications  to Text Classification

Solution (high dimensionality)

recall the previous example

Page 55: Transfer Learning with Applications  to Text Classification

Solution (high dimensionality)

recall the previous example

The blues are closer to the greens than to the

reds

Page 56: Transfer Learning with Applications  to Text Classification

Solution (high dimensionality)

recall the previous example

Page 57: Transfer Learning with Applications  to Text Classification

Solution (high dimensionality)

The blues are closer to the reds than to the greens

recall the previous example

Page 58: Transfer Learning with Applications  to Text Classification

Properties It can bring the marginal distributions of two

domains closer.- Marginal distributions are brought closer in high-dimensional space (section 3.2)- Two marginal distributions are further minimized in low dimensional space. (theorem 3.2)

It brings two domains conditional distributions closer.- Nearby instances from two domains have similar

conditional distributions (section 3.3)

It can reduce domain transfer risk- The risk of nearest neighbor classifier can be bounded in transfer learning settings. (theorem 3.3)

Page 59: Transfer Learning with Applications  to Text Classification

Experiment (I)

Data Sets 20 News Groups

20000 newsgroup articles SRAA (simulated real auto aviation)

73128 articles from 4 discussion groups (simulated auto racing, simulated aviation, real autos, and real aviation)

Reuters 21758 Reuters news articles (1987)

Page 60: Transfer Learning with Applications  to Text Classification

Experiment (I)

Data Sets 20 News Groups

20000 newsgroup articles SRAA (simulated real auto aviation)

73128 articles from 4 discussion groups (simulated auto racing, simulated aviation, real autos, and real aviation)

Reuters 21758 Reuters news articles (1987)

First fill up the “GAP”, then useknn classifier to do classification

20 News groups

comp

comp.sys

comp.graphics

rec

rec.sport

rec.auto

Out-Domain

In-Domain

Page 61: Transfer Learning with Applications  to Text Classification

Experiment (I)

Data Sets 20 News Groups

20000 newsgroup articles SRAA (simulated real auto aviation)

73128 articles from 4 discussion groups (simulated auto racing, simulated aviation, real autos, and real aviation)

Reuters 21758 Reuters news articles (1987)

Baseline methods naïve Bayes, logistic regression, SVMs Knn-Reg: missing value filled without SVD pLatentMap: SVD but missing value as 0

Page 62: Transfer Learning with Applications  to Text Classification

Experiment (I)

Data Sets 20 News Groups

20000 newsgroup articles SRAA (simulated real auto aviation)

73128 articles from 4 discussion groups Reuters

21758 Reuters news articles Baseline methods

naïve Bayes, logistic regression, SVM Knn-Reg: missing value filled without SVD pLatentMap: SVD but missing value as 0

Try to justify the two steps in our framework

Page 63: Transfer Learning with Applications  to Text Classification

Learning Tasks

Page 64: Transfer Learning with Applications  to Text Classification

Experiment (II)10 win1 lossOverall performance

Page 65: Transfer Learning with Applications  to Text Classification

Experiment (III)

knnReg: Missing values filled but without SVD

Compared with knnReg8 win3 loss

pLatentMap: SVD but without filling missing values

Compared with pLatentMap8 win3 loss

Page 66: Transfer Learning with Applications  to Text Classification

Conclusion Problem: High dimensional overlapping domain

transfer -– text and image categorization

Step 1: Missing values filling up

--- Bring two domains’ marginal distributions closer

Step 2: SVD dimension reduction

--- Further bring two marginal distributions closer (Theorem 3.2)

--- Cluster points from two domains, making conditional distribution transferable. (Theorem 3.3

Page 67: Transfer Learning with Applications  to Text Classification

Recommended