Processing Learner Texts: from Annotation tows390/doc/annotating-learner...Annotating learner texts...

Processing Learner Texts: from Annotation to . . .

Weiwei Sun

Wangxuan Institute of Computer TechnologyPeking University

@BCLU 2020

give a topic and then discussion about it

Is it a good English sentence?

Can you guess the meaning?

1 of 60

English as a Second Language (ESL)

L1 speakers29%

L2 speakers

71%

from Ethnologue (2019, 23rd edition);

898.4 million ESL speakers!

2 of 60

Learner texts are everywhere . . .

Language Tests

3 of 60


Social Network

3 of 60


my paper

3 of 60


https://acl2020.org/blog/general-conference-statistics/

and perhaps yours 3 of 60

First languages, second languages, cross-lingual transfer

L1 has an influence on L2

4 of 60

Something like

japanesenative

chineseto learn L2-chi, L1-jpn

englishnative

chineseto learn L2-chi, L1-eng

learn

learn

5 of 60

Universals

Noun Phrase Accessibility Hierarchy (?, ?)

Subject � direct object � indirect object � oblique � genitive � object ofcomparisonIf a language can relativize on a position on the hierarchy, then any otherhigher position can also be relativized on.

(1) a. the man who I am taller than � object of comparison

b. the man whose father I know � genitive

For example, if a language allows (1a), then it allows (1b).

A universal of SLA

L2 learners find relative clauses higher on the hierarchy easier to acquire.

6 of 60

Annotating learner texts

There is naturally a need to automatically annotate second language datawith rich lexical, syntactic, semantic and even pragmatic information.

High-performance automatic annotation,

I from an engineering perspective, enables deriving high-qualityinformation by structuring this specific type of data, and

I from a scientific perspective, enables quantitative studies for SecondLanguage Acquisition, which is complementary to hands-onexperiences in interpreting second language phenomena.

Is this talk about annotating grammatical errors? Not really.

7 of 60







7 of 60







7 of 60

Data: Reddit (https://www.reddit.com)

Large-scale L2 texts are available!

250M native and non-native English sentences (3.8B tokens), coveringover 45K authors from 50 countries (Rabinovich et al., 2018)

L1 Sentence

French I have to go to the Dr. to do a rapid check on my heart stability.

FrenchMaybe put every name through a manual approbation pipelineso it ensures quality.

FrenchPolls have shown public approbation for this law is somewherebetween 58% and 65%, and it has been a strong promiseduring the presidential campaign.

ItalianThe event was even more shocking because the precedent eveninghe wasn’t sick at all.

8 of 60

https://www.reddit.com

(Automatic) Annotation for learner languages

2010

POS tags(Dıaz-Negrillo et al., 2010)

2012 2013 2014 2016 2017

POS tagssyntactic dependencies

(Ragheb & Dickinson, 2012)(Dickinson & Ragheb, 2013)(Ragheb & Dickinson, 2014)

Phrase Structure(Nagata & Sakaguchi, 2016)

Universal Dependencies(Berzak et al., 2016)

UD for learner Chinese(Lee et al., 2017)

Annotated L2 texts are available!

Corpus L2 L1 #sent Structure

TLE English Multiple 5,124 Universal DependencyKonan-JIEM English Japanese 3,260 Phrase StructureICNALE English 10 Asian countries 1,930 Phrase Structure

Chinese-CFL Chinese Multiple 451 Universal Dependency

9 of 60

Data: lang-8 (http://lang-8.com)

Large-scale L2-L1 parallel texts are available!

6.8M English sentence pairs and 720K Chinese sentence pairs (Mizumotoet al., 2011; Y. Zhao et al., 2018)

L2 speaker 城市里的人能度过多方面的生活。corrected 城市里的人能过丰富多彩的生活。

L2 speaker You know what should I done.corrected You know what I should have done.

10 of 60

http://lang-8.com

Patterns of cross-lingual transfer

NP

NN

smell

NN

bread

smell of bread

L2-English

Standard

NP(x0:NN x1:NN) → x1 x0

VP

NP

sports

ADVP

often

VB

play

often play sports

... ...

... ...

L2-English

Standard

VP(x0:VB x1:ADVP x2:NP) → x1 x0 x2

bread smell

面包香气Mandarin

smell of breadEnglish

play often sports

faire souvent du sportFrench

often play sportsEnglish

11 of 60

Using patterns

Learner Texts

Tree-String Patterns Vectors Phylogenetic Structure

RussianPolishUkrainianSpanishFrenchPortuguese(Brazilian)ItalianPersianGermanHindi

Zhao et al. (2020); arXiv:2007.0907612 of 60

(Automatic) Annotation for learner languages

2010

POS tags(Dıaz-Negrillo et al., 2010)

2012 2013 2014 2016 2017 2018 2020

POS tagssyntactic dependencies

(Ragheb & Dickinson, 2012)(Dickinson & Ragheb, 2013)(Ragheb & Dickinson, 2014)

Phrase Structure(Nagata & Sakaguchi, 2016)

Universal Dependencies(Berzak et al., 2016)

UD for learner Chinese(Lee et al., 2017)

Semantic Roles(Lin et al., 2018)

English Resource Semantics(Y. Zhao et al., 2020)

Negation Scope Resolution(undersubmission)

capture “meanings”13 of 60

Can human understand interlanguage robustly?

/ It is difficult to define the syntactic formalism of learner language.

Grammaticality judgement

, But sometimes we can understand what they mean . . .

14 of 60

Can human understand interlanguage robustly?

/ It is difficult to define the syntactic formalism of learner language.

Grammaticality judgement

, But sometimes we can understand what they mean . . .

14 of 60

Research questions

I How can we capture meanings of L2s?How can we annotate L2 texts?Are there many differences from annotating L1 texts?

I How badly does an L1 data-trained semantic parser perform?Can state-of-the-art grammatical error correction systems help?

I What role does syntactic parsing play in processing L2 texts?What role does cross-lingual transfer play?

15 of 60

Research questions




15 of 60

Shallow and not-that-shallow meaning representations

Semantic Role Labeling

Some boys want to go .

ARG0 ARG1

Bi-lexical Semantic Dependency

Some boys want to go .

BV ARG1

ARG1

ARG2

Conceptual Graphs

some q

boy n 1

want v to

go v 1

BV ARG1 ARG2

ARG1

16 of 60

Interface Hypothesis

Language structures involving an interface between different languagemodules, like syntax-semantics interface and semantics-pragmaticsinterface, are less likely to be acquired completely than structures that donot involve this interface; see e.g. Sorace (2011).

VP

PP

NP

PNit

Pabout

VP

NP

NP

Ndiscussion

CONJand then

NP

Ntopic

Da

Vgive

17 of 60

Literal meaning versus intended meaning

Literal Meaning Intended Meaningoften

6=conventional meaning

sentence meaningspeaker meaning

interpretation

linguistic code features author’s intention

give v 1

pron

pronoun q

and+then c

topic n of

a q

discussion n 1

udef q

about p

pron pronoun q

BV

ARG1

ARG2

L-INDEX

BV

R-INDEX BV

ARG1

ARG2

BV

pronoun q

pron

give v 1

a q topic n of

and+then c

discuss v 1

pron pronoun q

BV

ARG1ARG2

L-INDEX

BV

R-INDEX

ARG2

BV

give a topic and then discussion about it. Give a topic and then discuss it.

18 of 60

SemBanking in Natural Language Processing

compositionally

non-compositionally

manually-annotated

grammar-based

PropBank(Kingsbury & Palmer, 2002)FrameNet(Baker et al., 1998)

Redwoods Treebank(Oepen et al., 2004)TREPIL(Rosen et al., 2005)Groningen Meaning Bank(Basile et al., 2012)

Abstract Meaning Representation(Banarescu et al., 2013)

comprehensivenessconsistencyscalability

Bender, E.M., Flickinger, D., Oepen, S., Packard, W. and Copestake, A.Layers of interpretation: On grammar and compositionality. ICWS 2015.

19 of 60

Two languages, three tasks

Chinese as a Second Language

I Semantic Role Labeling

I Negation Scope Resolution

English as a Second Language

I Compositional Semantics

Semantic Graph Parsing

20 of 60


compositionally

non-compositionally

manually-annotated

grammar-based




comprehensivenessconsistencyscalability

4


21 of 60

Semantic Role Labeling

I Argument (AN): Who did what to whom ?

I Adjunct (AM): When , where , why and how ?

I ate breakfast quickly in the car this morning because I was in a hurry

A0 A1

AM-MNR

AM-LOC

AM-TMP

AM-PRP

Our work

Z. Lin, Y. Duan, Y. Zhao, W. Sun and X. Wan. Semantic Role Labelingfor Learner Chinese: the Importance of Syntactic Analysis and L2-L1Parallel Data. EMNLP 2018.

22 of 60

Data source

initial collection1,108,907 pairs

717,241 pairsclean up reasonably

sized

manual

selectionannotation

typologically different mother tongues

Chinese Sino-TibetanRussian SlavicArabic SemiticJapanese ?English Germanic

23 of 60

Inter-annotator agreement

I Annotator: two students majoring in linguistics

I The first 50-sentence trial set: adapting and refining the ChinesePropBank secification

I The rest 100-sentence set: reporting the inter-annotator agreement

ENG JPN RUS ARA90

92

94

96

98

100

L1

L2

24 of 60

Negation Scope Resolution

I Negation cue: linguistic unit that expresses negation.

I Negation event: the event related to a cue.

I Negation scope : the maximum part(s) of the sentence that areinfluenced or negated by negation cue.

(2) a. We needs actions and not thoughts .

b. He failed to catch the first train .

c. This is an un clean desk .

d. 换言说，没有宗教生活与日常生活差距。

e. 换言说，宗教生活与日常生活之间没有距离。

Our work

M. Zhang, W. Wang, Y. Zhao, S. Sun, W. Sun and X. Wan. NegationScope Resolution for Chinese as a Second Language. (under submission)

25 of 60

Inter-annotator agreement

C. F1 S. F1 T. F1 C. Kappa S. Kappa

CD-SCO (2012) 94.88 85.04 91.53 - -SFU Review (2012) 92.79 81.88 - 92.70 87.20BioScope (2008) 98.65 95.91 - - -CNeSp (2015) - - - 95.00 93.00

L2-chi, L1-eng 100.00 92.55 97.12 100.00 96.13chiL2⇒L1, L1-eng 100.00 92.55 97.71 100.00 95.65L2-chi, L1-jpn 100.00 90.09 94.62 100.00 92.15chiL2⇒L1, L1-jpn 100.00 90.09 94.35 100.00 92.32

I Inter-annotator agreement w.r.t. negation cue and scope.

I Previous negation corpora reported cue-level F1 (C. F1) at 91%-95%,scope-level F1 (S. F1) at 76%-85%, token-level F1 (T. F1) at88%-92%, and kappa at 87%-91%.

26 of 60


compositionally

non-compositionally

manually-annotated

grammar-based




comprehensivenessconsistencyscalability 4


27 of 60

Treebank of Learner English (Berzak et al., 2016)

TLE (http://esltreebank.org/)

I a collection of 5,124 ESL sentences

I manually annotated with POS tags and UD trees

I in original and error corrected forms.

28 of 60

http://esltreebank.org/

SemBanking by intergrating TLE and ERG

Input

Sentence

ACE/PET

parser

(Packard, 2013)

Meaning Representation 1


...Meaning Representation K

RerankerMeaning

Representation

29 of 60


Input

Sentence

ACE/PET

parser

(Packard, 2013)

English Resource Grammar

(ERG; Flickinger (1999))

Hand-crafted computational grammar;

HPSG-based;

25+ person years




RerankerMeaning

Representation

29 of 60


Input

Sentence

ACE/PET

parser

(Packard, 2013)




HPSG-based;

25+ person years

ACE/PET

parser

(Packard, 2013)




RerankerMeaning

Representation

29 of 60


Input

Sentence

ACE/PET

parser

(Packard, 2013)




HPSG-based;

25+ person years

ACE/PET

parser

(Packard, 2013)




RerankerMeaning

Representation

29 of 60


Input

Sentence

ACE/PET

parser

(Packard, 2013)




HPSG-based;

25+ person years

ACE/PET

parser

(Packard, 2013)




Reranker

Universal Dependencies

(UD)

manually annotated

Meaning

Representation

29 of 60


Input

Sentence

ACE/PET

parser

(Packard, 2013)




HPSG-based;

25+ person years

ACE/PET

parser

(Packard, 2013)




Reranker


(UD)

manually annotated

K candidate graphs G1,G2, ...,GK and gold UD tree T :

G = arg max16i6K score(Gi ,T )

score(Gi ,T ) = WTF(fGi, fT )

Meaning

Representation

29 of 60


Input

Sentence

ACE/PET

parser

(Packard, 2013)




HPSG-based;

25+ person years

ACE/PET

parser

(Packard, 2013)




Reranker


(UD)

manually annotated

K candidate graphs G1,G2, ...,GK and gold UD tree T :

G = arg max16i6K score(Gi ,T )

score(Gi ,T ) = WTF(fGi, fT )

Meaning

Representation

Target Representation:

Elementary Dependency Structures

(EDS; Oepen and Lønning (2006))

and several others

29 of 60

Performance of reranking

−Rerank Rerank(50) Rerank(500) Oracle(50) Oracle(500)85

90

95

100

IAA-EDMsmatch

EDM

Evaluation on DeepBank

Inter-Annotator Agreement (IAA) of EDM is reported in Bender et al.(2015).

30 of 60

Research questions




31 of 60








some q

boy n 1

want v to

go v 1

BV ARG1 ARG2

ARG1

31 of 60

Evaluation – smatch (Cai & Knight, 2013)

a:want

c:go

b:boy

gold graph

ARG1

ARG0

ARG0

a′:want

c′:go

b′:boy

predicted graph

ARG1

ARG0

1 : (a, b,ARG0) 1 : (a′, b′,ARG0) vxx ′ = 1, x ∈ {a, b, c}2 : (a, c ,ARG1) 2 : (a′, c ′,ARG1) t11 = 13 : (c, b,ARG0) t22 = 1

32 of 60

Evaluation – smatch (Cai & Knight, 2013)

a:want

c:go

b:boy

gold graph

ARG1

ARG0

ARG0

a′:want

c′:go

b′:boy

predicted graph

ARG1

ARG0

1 : (a, b,ARG0) 1 : (a′, b′,ARG0) vxx ′ = 1, x ∈ {a, b, c}2 : (a, c ,ARG1) 2 : (a′, c ′,ARG1) t11 = 13 : (c, b,ARG0) t22 = 1

32 of 60

Error-oriented smatch

give v 1

pron

pronoun q

and+then c

topic n of

a q

discussion n 1

udef q

about p

pron pronoun q

BV

ARG1

ARG2

L-INDEX

BV

R-INDEX BV

ARG1

ARG2

BV

I Most of the structure is good.

I We need to focus on errors!

Add weights

33 of 60


give v 1

pron

pronoun q

and+then c

topic n of

a q

discussion n 1

udef q

about p

pron pronoun q

BV

ARG1

ARG2

L-INDEX

BV

R-INDEX BV

ARG1

ARG2

BV

I Most of the structure is good.

I We need to focus on errors!Add weights

33 of 60

Node-relaxed smatch

discussion n 1

udef q

about p

pron pronoun q

Predicted Graph

discuss v 1

pron pronoun q

Gold Graph

BV

ARG1

ARG2

BV

ARG2

BV

?

We went this discuss .

We went to this discussion . Statistical Machine

Translation (SMT)

discuss ↔ discussion

Paraphrase Table

Large-scale L2-L1 data exists: lang-8

34 of 60

Node-relaxed smatch

discussion n 1

udef q

about p

pron pronoun q

Predicted Graph

discuss v 1

pron pronoun q

Gold Graph

BV

ARG1

ARG2

BV

ARG2

BV

?

We went this discuss .

We went to this discussion . Statistical Machine

Translation (SMT)

discuss ↔ discussion

Paraphrase Table

Large-scale L2-L1 data exists: lang-8

34 of 60

Semantic parsers (Koller et al., 2019)

factorization-basedcomposition-based

transition-based

translation-based

2014 2015 2016 2017 2018 2019

0.5

0.6

0.7

0.8

Modeling syntactico-semantic derivation/composition vsderived/composed structures

35 of 60

Compositionality

The meaning of an expression is a function of the meaningsof its parts and of the way they are syntactically combined.

B. Partee

brown + cow = brown cow

Fodor and Lepore (2002)

36 of 60

Hyperedge Replacement Grammar-based parser

and then

ARG1CONJ

discussion

about

pronudef

pronoun

BV ARG1 ARG2BV

NP

and then discussion about it

R-INDEX

NP

CONJ

NPdiscussion

about

pron

udefpronoun

BV ARG1 ARG2BV

R-INDEX

ARG1

then

and

NP

37 of 60


and then

ARG1CONJ

discussion

about

pronudef

pronoun

BV ARG1 ARG2BV

NP


R-INDEX

NP

CONJ

NP

discussion

about

pron

udefpronoun

BV ARG1 ARG2BV

R-INDEX

ARG1

then

and

NP

37 of 60


and then

ARG1CONJ

discussion

about

pronudef

pronoun

BV ARG1 ARG2BV

NP


R-INDEX

NP

CONJ

NP

discussion

about

pron

udefpronoun

BV ARG1 ARG2BV

R-INDEX

ARG1

then

and

NP

37 of 60


and then

ARG1CONJ

discussion

about

pronudef

pronoun

BV ARG1 ARG2BV

NP


R-INDEX

NP

CONJ

NPdiscussion

about

pron

udefpronoun

BV ARG1 ARG2BV

R-INDEX

ARG1

then

and

NP

37 of 60

Factorization-based parser

I Inspired by graph-based dependency parsersI Explicitly models the target structure

topic and then discussion

encoder encoder encoder encoderr4 r6

arg max

topic n of and c then a 1 discussion n 1

c4 c6Biaffine

ScoreEdge( and c→ discussion n 1)

Systems

Y. Chen, Y. Ye and W. Sun. 2019. Peking at MRP 2019: Factorization-and Composition-Based Parsing for Elementary Dependency Structures.CoNLL Shared Task on Cross-Framework Meaning Representation Parsing.

38 of 60





arg max


c4 c6Biaffine


Systems


38 of 60




encoder encoder encoder encoder

r4 r6

arg max


c4 c6Biaffine


Systems


38 of 60




encoder encoder encoder encoder

r4 r6

arg max


c4 c6Biaffine


Systems


38 of 60





arg max


c4 c6Biaffine


Systems


38 of 60





arg max


c4 c6Biaffine


Systems


38 of 60





arg max


c4 c6Biaffine


Systems


38 of 60





arg max


c4 c6Biaffine


Systems


38 of 60





arg max


c4 c6Biaffine


Systems


38 of 60





arg max


c4 c6

Biaffine


Systems


38 of 60





arg max


c4 c6Biaffine


Systems


38 of 60





arg max


c4 c6Biaffine


Systems


38 of 60





arg max


c4 c6Biaffine


Systems


38 of 60

Results — Parsing to literal meanings

DeepBank L1 L2

80

90

100

Composition-Based

Factorization-Based

Model Error-oriented Node Edge All

Composition-BasedNO 89.04 82.14 85.75

YES 71.46 79.77 76.66

Factorization-BasedNO 90.96 84.48 87.86

YES 73.55 80.27 77.75

39 of 60


DeepBank L1 L2

80

90

100

Composition-Based

Factorization-Based

Model Error-oriented Node Edge All

Composition-BasedNO 89.04 82.14 85.75

YES 71.46 79.77 76.66

Factorization-BasedNO 90.96 84.48 87.86

YES 73.55 80.27 77.75

39 of 60


WOadv WOinc Others Wform Pform Nn ArtOrDet

40

60

80

Composition-Based

Factorization-Based

I ArtOrDet:It is obvious to see that (internet → the internet) saves people timeand also connects people globally.

I WOinc:(Someone having what kind of disease → What kind of diseasesomeone has) is a matter of their own privacy.

40 of 60

Grammatical Error Correction (GEC)

Original Sentence

give a topic and thendiscussion about it .

Figure 1: Architecture of our multilayer convolutional modelwith seven encoder and seven decoder layers (only one en-coder and one decoder layer are illustrated in detail).

si ∈ Rd is given by si = w(si) + p(i), where w(si) is theword embedding and p(i) is the position embedding cor-responding to the position i of token si in the source sen-tence. Both embeddings are obtained from embedding ma-trices that are trained along with other parameters of the net-work.

The encoder and decoder are made up of L layers each.The architecture of the network is shown in Figure 1. Thesource token embeddings, s1, . . . , sm, are linearly mappedto get input vectors of the first encoder layer, h0

1, . . . ,h0m,

where h0i ∈ Rh and h is the input and output dimension of

all encoder and decoder layers. Linear mapping is done bymultiplying a vector with weights W ∈ Rh×d and addingthe biases b ∈ Rh:

h0i = Wsi + b

In the first encoder layer, 2h convolutional filters of dimen-sion 3 × h map every sequence of three consecutive inputvectors to a feature vector f1i ∈ R2h. Paddings (denoted by<pad> in Figure 1) are added at the beginning and end ofthe source sentence to retain the same number of output vec-tors as the source tokens after the convolution operations.

f1i = Conv(h0i−1,h

0i ,h

0i+1)

where Conv(·) represents the convolution operation. This isfollowed by a non-linearity using gated linear units (GLU)(Dauphin et al. 2016):

GLU(f1i ) = f1i,1:h ◦ σ(f1i,h+1:2h)

where GLU(f1i ) ∈ Rh, ◦ and σ represent element-wise mul-tiplication and sigmoid activation functions, respectively,and f1i,u:v denotes the elements of f1i from indices u to v(both inclusive). The input vectors to an encoder layer arefinally added as residual connections. The output vectors ofthe lth encoder layer are given by,

hli = GLU(f li ) + hl−1

i i = 1, . . . ,m

Each output vector of the final encoder layer, hLi ∈ Rh, is

linearly mapped to get the encoder output vector, ei ∈ Rd,using weights We ∈ Rd×h and biases be ∈ Rd:

ei = WehLi + be i = 1, . . . ,m

Now, consider the generation of the target word tn atthe nth time step in decoding, with n − 1 target wordspreviously generated. For the decoder, paddings are addedat the beginning. The two paddings, beginning-of-sentencemarker and the previously generated tokens, are embeddedas t−2, t−1, t0, t1, . . . , tn−1 in the same way as source to-ken embeddings are computed. Each embedding tj ∈ Rd islinearly mapped to g0

j ∈ Rh and passed as input to the firstdecoder layer. In each decoder layer, convolution operationsfollowed by non-linearities are performed on the previousdecoder layer’s output vectors gl−1

j , where j = 1, . . . , n:

ylj = GLU(Conv(gl−1

j−3,gl−1j−2,g

l−1j−1)

where Conv(·) and GLU(·) represent convolutions and non-linearities respectively, and yl

j becomes the decoder state atthe jth time step in the lth decoder layer. The number and sizeof convolution filters are the same as those in the encoder.

Each decoder layer has its own attention module. To com-pute attention at layer l before predicting the target tokenat the nth time step, the decoder state yl

n ∈ Rh is lin-early mapped to a d-dimensional vector with weights Wz ∈Rd×h and biases bz ∈ Rd, adding the previous target to-ken’s embedding:

zln = Wzyln + bz + tn−1

The attention weights αln,i are computed by a dot product of

the encoder output vectors e1, . . . , em with zln and normal-ized by a softmax:

αln,i =

exp(e�i zln)∑m

k=1 exp(e�k zln)i = 1, . . . ,m

The source context vector xln is computed by applying the

attention weights to the summation of the encoder outputvectors and the source embeddings. The addition of thesource embeddings helps to better retain information aboutthe source tokens.

xln =

m∑

i=1

αln,i(ei + si)

The context vector xln is then linearly mapped to cln ∈ Rh.

The output vector of the lth decoder layer, gln, is the summa-tion of cln, yl

n, and the previous layer’s output vector gl−1n .

gln = yl

n + cln + gl−1n

The final decoder layer output vector gLn is linearly mapped

to dn ∈ Rd. Dropout (Srivastava et al. 2014) is applied atthe decoder outputs, embeddings, and before every encoderand decoder layer. The decoder output vector is then mappedto the target vocabulary size (|Vt|) and softmax is computedto obtain target word probabilities.

on = Wodn + bo Wo ∈ R|Vt|×d,bo ∈ R|Vt|

5757

Chollampatt and Ng (2018)158

Copy Scores Vocabulary Distribution

Final Distribution

!1 !3!2 !4 "1 "3"2 "4 "5

#1 #3#2 #4

Encoder Decoder

Attention Distribution

h1src h2src h3src h4src h1trg h2

trg h3trg h4

trg h5trg

+α tcopy

×N×N

Token-level labeling output

Figure 1: Copy-Augmented Architecture.

and generating is controlled by a balancing factorαcopyt ∈ [0, 1] at each time step t.

pt(w) = (1−αcopyt )∗pgent (w)+(αcopy

t )∗pcopyt (w)(5)

The new architecture outputs the generationprobability distribution as the base model, by gen-erating the target hidden state. The copying scoreover the source input tokens is calculated with anew attention distribution between the decoder’scurrent hidden state htrg and the encoder’s hiddenstates Hsrc (same as hsrc1...N ). The copy attention iscalculated the same as the encoder-decoder atten-tions, listed in Equation 6, 7, 8 :

qt,K, V = htrgt W Tq , H

srcW Tk , H

srcW Tv (6)

At = qTt K (7)

P copyt (w) = softmax(At) (8)

The qt, K and V are the query, key, and valuethat needed to calculate the attention distributionand the copy hidden state. We use the normalizedattention distribution as the copy scores and usethe copy hidden states to estimate the balancingfactor αcopy

t .

αcopyt = sigmoid(W T

∑(AT

t · V )) (9)

The loss function is as described in Equation 4,but with respect to our mixed probability distribu-tion yt given in Equation 5.

3 Pre-training

Pre-training is shown to be useful in many taskswhen lacking vast amounts of training data. Inthis section, we propose denoising auto-encoders,which enables pre-training our models with large-scale unlabeled corpus. We also introduce a par-tially pre-training method to make a comparisonwith the denoising auto-encoder.

3.1 Denoising Auto-encoder

Denoising auto-encoders (Vincent et al., 2008) arecommonly used for model initialization to extractand select features from inputs. BERT (Devlinet al., 2018) used a pre-trained bi-directional trans-former model and outperformed existing systemsby a wide margin on many NLP tasks. In contrastto denoising auto-encoders, BERT only predictsthe 15% masked words rather than reconstructingthe entire input. BERT denoise the 15% of thetokens at random by replacing 80% of them with[MASK], 10% of them with a random word and10% of them unchanged.

Inspired by BERT and denoising auto-encoders,we pre-traine our copy-augmented sequence to se-quence model by noising the One Billion WordBenchmark (Chelba et al., 2013), which is a largesentence-level English corpus. In our experiments,the corrupted sentence pairs are generated by the

Zhao et al. (2019)

manually annotated

Give a topic and then discuss it .

give a topic and then discuss it .

give a topic and then discuss it .

Semantic Parser

41 of 60

Results — Parsing to intended meanings

Original Chollampattand Ng (2018)

Zhao et al.(2019)

ManuallyCorrected

70

80

90

Standard smatch


Node-relaxed smatch

Model F0.5

Chollampatt and Ng (2018) 45.36W. Zhao et al. (2019) 61.15

42 of 60

Research questions




43 of 60








用汉语也说话快对我来说很难

ARG0

AM

AM

43 of 60

Three SRL systems

Parsers

Systems PCFGLA-parser-basedSRL system

Neural-parser-basedSRL system

Neural syntax-agnosticSRL system

Minimal span-based parser

Berkeley parser

Performance<

Trained on Chinese TreeBank that has SRL in CPB

Trained on Chinese PropBank (CPB)

44 of 60

Syntax-agnostic SRL

B-A0 I-A0 I-A0 I-A0 I-A0 B-AM I-AM I-AM I-AM B-AM REL


45 of 60

Syntax-agnostic SRL



45 of 60

Syntax-agnostic SRL



用汉语也说话

BiLSTM BiLSTM BiLSTM BiLSTM

MLP MLP MLP MLP

B-A0 I-A0 I-A0 I-A0

45 of 60

Syntax-based SRL

CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

ADJ

难hard

ADVP

很very

PP

对我来说for me

VP

说话快speaking quickly

ADVP

也also

IP

用汉语using Chinese

REL

A0

AM

NULL

AM AM

46 of 60

Syntax-based SRL

CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

ADJ

难hard

ADVP

很very

PP

对我来说for me

VP


ADVP

也also

IP


REL

A0

AM

NULL

AM AM

46 of 60

Syntax-based SRL

CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

ADJ

难hard

ADVP

很very

PP

对我来说for me

VP


ADVP

也also

IP


REL

A0

AM

NULL

AM AM

46 of 60

Syntax-based SRL

CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

ADJ

难hard

ADVP

很very

PP

对我来说for me

VP


ADVP

也also

IP


REL

A0

AM

NULL

AM AM

46 of 60

Syntax-based SRL

CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

ADJ

难hard

ADVP

很very

PP

对我来说for me

VP


ADVP

也also

IP


REL

A0

AM

NULL

AM AM

46 of 60

Syntax-based SRL

CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

ADJ

难hard

ADVP

很very

PP

对我来说for me

VP


ADVP

也also

IP


REL

A0

AM

NULL

AM

AM

46 of 60

Syntax-based SRL

CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

ADJ

难hard

ADVP

很very

PP

对我来说for me

VP


ADVP

也also

IP


REL

A0

AM

NULL

AM AM

46 of 60

Evaluation and findings

ENG JPN RUS ARA ALL

65

70

75

80

Performance on L1

PCFGLA-parser-based-L1

Neural-parser-based-L1

Neural syntax-agnostic-L1

The syntax-based systems are more robust when handling learner texts.

47 of 60


ENG JPN RUS ARA ALL

65

70

75

80

Performance on L1 & L2

PCFGLA-parser-based-L1 Neural-parser-based-L1 Neural syntax-agnostic-L1



47 of 60


ENG JPN RUS ARA ALL

65

70

75

80

Performance on L1 & L2

PCFGLA-parser-based-L1 Neural-parser-based-L1 Neural syntax-agnostic-GAP



47 of 60

Why syntactic analysis is important?

用汉语也说话快对我来说很难啊。Using Chinese also speaking quickly to me very hard.

Gold

A0 rel

A0

Syntax-based system

Neural end-to-end system

AM AM

A0 AM AM AM rel

AM rel

It is very hard for me to speak Chinese quickly.

48 of 60


CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

难hard

ADVP

很very

PP

对我来说for me

VP


ADVP

也also

IP


Though the whole structure is bad , some parts may be good .

49 of 60


CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

难hard

ADVP

很very

PP

对我来说for me

VP


ADVP

也also

IP


Though the whole structure is bad ,

some parts may be good .

49 of 60


CP

PU

。.

SP

啊mod

IP

VP

VP

VP

VP

难hard

ADVP

很very

PP

对我来说for me

VP


ADVP

也also

IP


Though the whole structure is bad , some parts may be good .

49 of 60

Research questions




50 of 60

Something like

japanesenative

chineseto learn L2-chi, L1-jpn

englishnative

chineseto learn L2-chi, L1-eng

learn

learn

50 of 60








(3) a. We needs actions and not thoughts .

b. He failed to catch the first train .

c. This is an un clean desk .

d. 换言说，没有宗教生活与日常生活差距。

e. 换言说，宗教生活与日常生活之间没有距离。

51 of 60

L2-Japanese data is more useful

L2 L1

ENG

L2 L1

JPN

Training set: ALL

L2 L1

ENG

L2 L1

JPN

Training set: L2-chi, L1-eng/jpn

L2 L1

ENG

L2 L1

JPN

Training set: L2-chi, L1-eng

TrainTest

L2-chi, L1-eng chiL2⇒L1, L1-eng L2-chi, L1-jpn chiL2⇒L1, L1-jpn

L2-chi, L1-eng 73.4/67.8/61.3 73.7/69.4/62.9 71.6/64.9/56.3 71.1/64.5/56.9

chiL2⇒L1, L1-eng 73.4/68.1/61.6 73.8/69.7/62.4 70.7/65.1/58.2 71.0/65.4/57.6

L2-chi, L1-jpn 73.2/65.6/57.3 74.7/68.0/60.7 75.2/68.5/62.8 74.2/68.4/61.5chiL2⇒L1, L1-jpn 72.9/65.1/58.0 74.0/68.3/60.8 74.6/68.4/59.8 74.1/68.5/60.7

Table: Recall scores of BERT-BiLSTM/ELMo-BiLSTM/Random-BiLSTM.

52 of 60

Some examples

(4) a. 所以大多数的人觉得受不了日本的夏天。

b. 我还没有上小学的时候。

c. 我们不应该恐怕说错还有不好意思的事。

I pro-drop

I relative clause

I coordination

53 of 60

Conclusion and future work




I How can we effectively enlarge annotated corpora?

I What is the best practice to annotate syntactic structures of secondlanguages?

I What types of computational analyses can we develop for secondlanguage acquisition?

I Can we access learners’ language capability by annotating theirlanguage outputs?

54 of 60

Conclusion and future work




I How can we effectively enlarge annotated corpora?

I What is the best practice to annotate syntactic structures of secondlanguages?

I What types of computational analyses can we develop for secondlanguage acquisition?

I Can we access learners’ language capability by annotating theirlanguage outputs?

54 of 60

Game over

THANK YOU

Joint work with Yuanyuan Zhao, Mengyu Zhang, Weiqi Wang, Zi Lin,Yuguang Duan

55 of 60

References I

Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The berkeley framenetproject. In Proceedings of the 17th international conference oncomputational linguistics-volume 1 (pp. 86–90).

Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob,U., . . . Schneider, N. (2013). Abstract meaning representation forsembanking. In Proceedings of the 7th linguistic annotationworkshop and interoperability with discourse (pp. 178–186).

Basile, V., Bos, J., Evang, K., & Venhuizen, N. (2012). Developing alarge semantically annotated corpus. In Eighth internationalconference on language resources and evaluation (pp. 3196–3200).

Bender, E. M., Flickinger, D., Oepen, S., Packard, W., & Copestake, A.(2015). Layers of interpretation: On grammar and compositionality.In Proceedings of the 11th international conference oncomputational semantics (pp. 239–249).

54 of 60

References II

Berzak, Y., Kenney, J., Spadine, C., Wang, J. X., Lam, L., Mori, K. S., . . .Katz, B. (2016, August). Universal dependencies for learner english.In Proceedings of the 54th annual meeting of the association forcomputational linguistics (volume 1: Long papers) (pp. 737–746).Berlin, Germany: Association for Computational Linguistics.Retrieved from http://www.aclweb.org/anthology/P16-1070

Cai, S., & Knight, K. (2013). Smatch: an evaluation metric for semanticfeature structures. In Acl 2013 (Vol. 2, pp. 748–752).

Chollampatt, S., & Ng, H. T. (2018, February). A multilayer convolutionalencoder-decoder neural network for grammatical error correction. InProceedings of the thirty-second aaai conference on artificialintelligence.

Dıaz-Negrillo, A., Meurers, D., Valera, S., & Wunsch, H. (2010). Towardsinterlanguage pos annotation for effective learner corpora in sla andflt. In Language forum (Vol. 36, pp. 139–154).

55 of 60

http://www.aclweb.org/anthology/P16-1070

References III

Dickinson, M., & Ragheb, M. (2013). Annotation for learner englishguidelines.

Flickinger, D. (1999). The english resource grammar..Fodor, J. A., & Lepore, E. (2002). The compositionality papers. Oxford

University Press.Kingsbury, P., & Palmer, M. (2002). From treebank to propbank. In Lrec

(pp. 1989–1993).Koller, A., Oepen, S., & Sun, W. (2019, July). Graph-based meaning

representations: Design and processing. In Proceedings of the 57thannual meeting of the association for computational linguistics:Tutorial abstracts (pp. 6–11). Florence, Italy: Association forComputational Linguistics. Retrieved fromhttps://www.aclweb.org/anthology/P19-4002

Lee, J., Leung, H., & Li, K. (2017). Towards universal dependencies forlearner chinese. In Proceedings of the nodalida 2017 workshop onuniversal dependencies (udw 2017) (pp. 67–71).

56 of 60

https://www.aclweb.org/anthology/P19-4002

References IV

Lin, Z., Duan, Y., Zhao, Y., Sun, W., & Wan, X. (2018). Semantic rolelabeling for learner chinese: the importance of syntactic parsing andl2-l1 parallel data. In Proceedings of the 2018 conference onempirical methods in natural language processing (pp. 3793–3802).

Mizumoto, T., Komachi, M., Nagata, M., & Matsumoto, Y. (2011).Mining revision log of language learning sns for automated japaneseerror correction of second language learners. In Proceedings of 5thinternational joint conference on natural language processing (pp.147–155).

Nagata, R., & Sakaguchi, K. (2016). Phrase structure annotation andparsing for learner english. In Proceedings of the 54th annualmeeting of the association for computational linguistics (volume 1:Long papers) (Vol. 1, pp. 1837–1847).

Oepen, S., Flickinger, D., Toutanova, K., & Manning, C. D. (2004). Lingoredwoods. Research on Language and Computation, 2(4), 575–596.

57 of 60

References V

Oepen, S., & Lønning, J. T. (2006). Discriminant-based mrs banking. InLrec (pp. 1250–1255).

Packard, W. (2013). Ace: the answer constraint engine. URLhttp://sweaglesw. org/linguistics/ace.

Rabinovich, E., Tsvetkov, Y., & Wintner, S. (2018). Native languagecognate effects on second language lexical choice. Transactions ofthe Association for Computational Linguistics, 6, 329–342.

Ragheb, M., & Dickinson, M. (2012, December). Defining syntax forlearner language annotation. In Proceedings of coling 2012: Posters(pp. 965–974). Mumbai, India. Retrieved from http://

cl.indiana.edu/~md7/papers/ragheb-dickinson12.html

58 of 60

http://cl.indiana.edu/~md7/papers/ragheb-dickinson12.html

http://cl.indiana.edu/~md7/papers/ragheb-dickinson12.html

References VI

Ragheb, M., & Dickinson, M. (2014). Developing a corpus ofsyntactically-annotated learner language for english. In Proceedingsof the 13th international workshop on treebanks and linguistictheories (tlt13 (pp. 292–300). Tubingen, Germany. Retrieved fromhttp://cl.indiana.edu/~md7/papers/

ragheb-dickinson14b.html

Rosen, V., Meurer, P., De Smedt, K., Butt, M., & King, T. H. (2005).Constructing a parsed corpus with a large lfg grammar. Proceedingsof LFG’05, 371–387.

Sorace, A. (2011). Pinning down the concept of “interface” inbilingualism. Linguistic approaches to bilingualism, 1(1), 1–33.

Zhao, W., Wang, L., Shen, K., Jia, R., & Liu, J. (2019). Improvinggrammatical error correction via pre-training a copy-augmentedarchitecture with unlabeled data..

59 of 60

http://cl.indiana.edu/~md7/papers/ragheb-dickinson14b.html

http://cl.indiana.edu/~md7/papers/ragheb-dickinson14b.html

References VII

Zhao, Y., Jiang, N., Sun, W., & Wan, X. (2018). Overview of the nlpcc2018 shared task: Grammatical error correction. In Ccf internationalconference on natural language processing and chinese computing(pp. 439–445).

Zhao, Y., Sun, W., Cao, J., & Wan, X. (2020). Semantic parsing forenglish as a second language. In Acl.

60 of 60

Date post:	08-Mar-2021
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

Processing Learner Texts: from Annotation tows390/doc/annotating-learner...Annotating learner texts...

Documents