+ All Categories
Home > Documents > NT2Lex - University of Rochestertetreaul/Presentations-and-Posters/... · 2018. 6. 16. · Swedish...

NT2Lex - University of Rochestertetreaul/Presentations-and-Posters/... · 2018. 6. 16. · Swedish...

Date post: 10-Feb-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
1
NT2Lex A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet Anaïs Tack 1,2 Thomas François 1 Piet Desmet 2 Cédrick Fairon 1 1 CENTAL, Université catholique de Louvain, Louvain-la-Neuve, Belgium 2 ITEC, imec, KU Leuven Kulak, Kortrijk, Belgium CEFR-GRADED LEXICONS a graded lexicon is a lexical database that includes lexical frequencies observed in texts graded along a difficulty scale Foreign language (L2) materials textbooks and readers / learner texts CEFR scale [A1 > A2 > B1 > B2 > C1 > C2] (Council of Europe, 2001) CEFRLex cental.uclouvain.be/cefrlex/ ANALYSIS Semantics ANALYSIS Frequency KEY TAKEAWAYS NT2Lex a new resource for Dutch as a foreign language (NT2) 17,743 entries with graded frequency distributions measure of receptive word difficulty measure of word sense complexity through linkage to Open Dutch WordNet cental.uclouvain.be/nt2lex/ French - FLELex (François et al., 2014) Swedish - SVALex (François et al., 2016) English - EFLLex (Dürlich & François, 2018) Swedish - SweLLex (Volodina et al., 2016) ANALYSIS Psycholinguistics NT2LEX Online tools for lexical complexity analysis database search CEFR-based complex word identification (Tack et al., 2016) Tools Corpus of reading materials corpus of 461,088 tokens 5 CEFR levels (A1, A2, B1, B2, C1) Preprocessing part-of-speech tagging with Frog (van den Bosch et al., 2007) SVM WSD tool trained on DutchSemCor (Vossen et al., 2012) linkage to Open Dutch WordNet (Postma et al., 2016) Lexical frequencies lexical entries with per-level observed frequency normalised for lexical dispersion (Carroll et al., 1971) Resource NT2LEX lemma pos sense synset A1 A2 B1 B2 C1 pakken to grab WW() pakken-v-1 odwn-10-101230891-v 35 117 101 5 - pakken to defeat WW() pakken-v-10 eng-30-01100145-v - 51 12 - - zijn to exist WW() zijn-v-1 eng-30-02603699-v 2,094 1,647 1,423 1,253 1,335 0 20 40 60 80 frequency 0.0 0.2 0.4 0.6 0.8 1.0 dispersion r 2 = 0.83 frequency correlation Subtlex-NL (Keuleers et al., 2010) Zipfian effects shorter = more frequent dispersion theoretical familiarity more dispersed = basic voc A1 A2 B1 B2 C1 TOTAL level 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 polysemes semasiology form > meaning mappings easy = more polysemous onomasiology meaning > form mappings lower degree of synonymy L2-specific lexicalisations 0 5 10 15 20 age of acquisition 0.00 0.05 0.10 0.15 0.20 0.25 0.30 density A1 A2 B1 B2 C1 TOTAL 0 2 4 6 concreteness 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 density A1 A2 B1 B2 C1 TOTAL interplay of psycholinguistic norms (Brysbaert et al., 2014)
Transcript
  • NT2LexA CEFR-Graded Lexical Resource for Dutch as a Foreign Language

    Linked to Open Dutch WordNet

    Anaïs Tack 1,2 Thomas François 1 Piet Desmet 2 Cédrick Fairon 11 CENTAL, Université catholique de Louvain, Louvain-la-Neuve, Belgium

    2 ITEC, imec, KU Leuven Kulak, Kortrijk, Belgium

    CEFR-GRADED LEXICONS

    a graded lexicon is a lexical database that includes lexical

    frequencies observed in texts graded along a difficulty scale

    Foreign language (L2) materials

    • textbooks and readers / learner texts

    • CEFR scale [A1 > A2 > B1 > B2 > C1 > C2] (Council of Europe, 2001)

    CEFRLex � cental.uclouvain.be/cefrlex/

    ANALYSIS Semantics

    ANALYSIS Frequency

    KEY TAKEAWAYS

    NT2Lex

    �� a new resource for Dutch as a foreign language (NT2)

    �� 17,743 entries with graded frequency distributions

    �� measure of receptive word difficulty

    �� measure of word sense complexity

    through linkage to Open Dutch WordNet

    � cental.uclouvain.be/nt2lex/

    French - FLELex(François et al., 2014)

    Swedish - SVALex(François et al., 2016)

    English - EFLLex(Dürlich & François, 2018)

    Swedish - SweLLex(Volodina et al., 2016)

    ANALYSIS Psycholinguistics

    NT2LEX

    Online tools for lexical complexity analysis

    • database search

    • CEFR-based complex word identification (Tack et al., 2016)

    Tools

    Corpus of reading materials

    • corpus of 461,088 tokens

    • 5 CEFR levels (A1, A2, B1, B2, C1)

    Preprocessing

    • part-of-speech tagging with Frog (van den Bosch et al., 2007)

    • SVM WSD tool trained on DutchSemCor (Vossen et al., 2012)

    • linkage to Open Dutch WordNet (Postma et al., 2016)

    Lexical frequencies

    • lexical entries with per-level observed frequency

    • normalised for lexical dispersion (Carroll et al., 1971)

    ResourceNT2LEX

    lemma pos sense synset A1 A2 B1 B2 C1pakkento grab

    WW() pakken-v-1 odwn-10-101230891-v 35 117 101 5 -

    pakkento defeat

    WW() pakken-v-10 eng-30-01100145-v - 51 12 - -

    zijnto exist

    WW() zijn-v-1 eng-30-02603699-v 2,094 1,647 1,423 1,253 1,335

    0 20 40 60 80frequency

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    disp

    ersi

    on

    r2 = 0.83

    frequency

    • correlation Subtlex-NL (Keuleers et al., 2010)

    • Zipfian effects

    shorter = more frequent

    dispersion

    • theoretical familiarity

    • more dispersed = basic voc

    A1 A2 B1 B2 C1 TOTALlevel

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    poly

    sem

    es

    semasiology

    • form > meaning mappings

    • easy = more polysemous

    onomasiology

    • meaning > form mappings

    • lower degree of synonymy

    • L2-specific lexicalisations

    0 5 10 15 20age of acquisition

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    dens

    ity

    A1A2B1B2C1TOTAL

    0 2 4 6concreteness

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    0.35

    0.40

    dens

    ity

    A1A2B1B2C1TOTAL

    interplay of psycholinguistic norms (Brysbaert et al., 2014)

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /CreateJDFFile false /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles false /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /DocumentCMYK /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice


Recommended