+ All Categories
Home > Documents > Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning...

Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning...

Date post: 04-May-2021
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
12
Imitation Learning from Imperfect Demonstration Yueh-Hua Wu 1,2 , Nontawat Charoenphakdee 3,2 , Han Bao 3,2 , Voot Tangkaratt 2 , Masashi Sugiyama 2,3 1 National Taiwan University 2 RIKEN Center for Advanced Intelligence Project 3 The University of Tokyo Poster #47 Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 1 / 12
Transcript
Page 1: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Imitation Learning from Imperfect Demonstration

Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,Voot Tangkaratt2, Masashi Sugiyama2,3

1National Taiwan University

2RIKEN Center for Advanced Intelligence Project

3The University of Tokyo

Poster #47

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 1 / 12

Page 2: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Introduction

Imitation learning

learning from demonstration instead of a reward function

Demonstration

a set of decision makings (state-action pairs x)

Collected demonstration may be imperfectDriving: traffic violationPlaying basketball: technical foul

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 2 / 12

Page 3: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Motivation

Confidence: how optimal is state-action pair x (between 0 and 1)

A semi-supervised setting: demonstration partially equipped with confidence

How?

crowdsourcing: N(1)/(N(1) + N(0)).digitized score: 0.0, 0.1, 0.2, . . . , 1.0

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 3 / 12

Page 4: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Generative Adversarial Imitation Learning [1]

One-to-one correspondence between the policy π and the distribution ofdemonstration [2]Utilize generative adversarial training

minθ

maxw

Ex∼pθ [logDw (x)] + Ex∼popt [log(1− Dw (x))]

Dw : discriminator, popt: demonstration distribution of πopt, and pθ: trajectorydistribution of agent πθ

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 4 / 12

Page 5: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Problem Setting

Human switches to non-optimal policies when they make mistakes or are distracted

p(x) = α p(x |y = +1)︸ ︷︷ ︸popt(x)

+(1− α) p(x |y = −1)︸ ︷︷ ︸pnon(x)

Confidence: r(x) , Pr(y = +1|x)

Unlabeled demonstration: {xi}nui=1 ∼ p

Demonstration with confidence: {(xj , rj)}ncj=1 ∼ q

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 5 / 12

Page 6: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Proposed Method 1: Two-Step Importance Weighting Imitation Learning

Step 1: estimate confidence by learning a confidence scoring function g

Unbiased risk estimator (come to Poster #47 for details):

RSC,`(g) = Ex ,r∼q[r · (`(g(x)))]︸ ︷︷ ︸Risk for optimal

+Ex ,r∼q[(1− r)`(−g(x))]︸ ︷︷ ︸Risk for non-optimal

Theorem

For δ ∈ (0, 1), with probability at least 1− δ over repeated sampling of data for training g ,

RSC,`(g)− RSC,`(g∗) = Op( n

−1/2c︸ ︷︷ ︸

# of confidence

+ n−1/2u︸ ︷︷ ︸

# of unlabeled

)

Step 2: employ importance weighting to reweight GAIL objective

Importance weighting

minθ

maxw

Ex∼pθ [logDw (x)] + Ex∼p[r(x)

αlog(1− Dw (x))]

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 6 / 12

Page 7: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Proposed Method 2: GAIL with Imperfect Demonstration and Confidence

Mix the agent demonstration with the non-optimal one

p′ = αpθ + (1− α)pnon

Matching p′ with p enables pθ = popt and meanwhile benefits from the large amountof unlabeled data.

Objective:

V (θ,Dw ) = Ex∼p[log(1− Dw (x))]︸ ︷︷ ︸Risk for P class

+αEx∼pθ [logDw (x)] + Ex ,r∼q[(1− r) logDw (x)]︸ ︷︷ ︸Risk for N class

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 7 / 12

Page 8: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Setup

Confidence is given by a classifier trained with the demonstration mixture labeled as optimal(y = +1) and non-optimal (y = −1)

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 8 / 12

Page 9: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Results: Higher Average Return of the Proposed Methods

Environment: MujocoProportion of labeled data: 20%

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 9 / 12

Page 10: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Results: Unlabeled Data Helps

More unlabeled data results in lower variance and better performance

proposed methods are robust to noise

(a) Number of unlabeled data. The number in thelegend indicates proportion of orignal unlabeled data.

(b) Noise influence. The number in the legend indicatesstandard deviation of Gaussian noise.

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 10 / 12

Page 11: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Conclusion

Two approaches that utilize both unlabeled and confidence data are proposed

Our methods are robust to labelers with noise

The proposed approaches can be generalized to other IL and IRL methods

Poster #47

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 11 / 12

Page 12: Imitation Learning from Imperfect Demonstration11-14-00... · 2019. 6. 7. · Imitation Learning from Imperfect Demonstration Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,

Reference

[1] Ho, Jonathan, and Stefano Ermon. ”Generative adversarial imitation learning.” Advancesin Neural Information Processing Systems. 2016.

[2] Syed, Umar, Michael Bowling, and Robert E. Schapire. ”Apprenticeship learning usinglinear programming.” Proceedings of the 25th international conference on Machinelearning. ACM, 2008.

Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 12 / 12


Recommended