+ All Categories
Home > Documents > CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile...

CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile...

Date post: 29-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
30
CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September 26-28
Transcript
Page 1: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

CoRoLa-based Word Embeddings

Dan Tufiș, Vasile Păiș

ICIA-MD, Romanian Academy

26.09.2018

DRuKoLA Workshop - Bucharest September 26-28

Page 2: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Word Embeddings – Dense representations of words in a low dimensional vector

space; research topic in the area of distributional semantics ( "a word is characterized by the company it keeps")

– Vector representations of the words in a large collection of texts/a corpus allowing to outline the relationships among these words and presumably automatically learn the meaning of the words; a more compact (partial) representation of a corpus

– The word embedding models offer “a spatial analogy to relationships between words” (Ben Schmidt’s blog October 25, 2015)

26.09.2018

DRuKoLA Workshop - Bucharest September 26-28

Page 3: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

– The technique of representing words as vectors is pretty old (vector space model was developed in the 1960s for IR) but currently it is mainly implemented by means of neural network (NN) architectures.

– They may be generated by unsupervized ML algorithms – most famous: word2vec, fastTxt, GloVe or Gensim

– The vectors (one for each word) have remarkable linear relationships popularized by Mikolov’s famous now equation:

vec(“king”)-vec(“man”)+vec(“woman”) = vec(“queen”)

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 4: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

vec(“king”)-vec(“man”) = vec(“queen”)-vec(“woman”)

Queen- Woman

Page 5: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

word2vec

• Word2Vec is an efficient predictive model for learning word embeddings from raw text.

• It includes two symmetrical algorithms: the Continuous Bag-of-Words (CBOW) model and the Skip-Gram model.

• Algorithmically, these models are similar.

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 6: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

https://www.slideshare.net/xavigiro/word-embeddings-d2l4-deep-learning-for-speech-

and-language-upc-2017 (Thanks to Antonio Bonafonte)

Page 7: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

https://www.slideshare.net/xavigiro/word-embeddings-d2l4-deep-learning-for-speech-

and-language-upc-2017 (Thanks to Antonio Bonafonte)

Page 8: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

What they are useful for?

• NLP (word clustering, document classification, tokenization, tagging, parsing, paraphrase detection, NERC, summarization, etc.),

• Speech processing (ASR, TTS),

• Inteligent Information Retrieval,

• Neural Machine Translation,

• Dialog systems,

• Stylometry and stylistic analysis, and many other DH areas.

26.09.2018

DRuKoLA Workshop - Bucharest September 26-28

Page 9: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

What to represent ?

• WEs are generated from large raw corpora, so that each word occurrence may be associated to a vector.

• One problem: words that do not appear in the corpus will not have a vector generated;

– solution to that: calculate vector representations for character n-grams. The words are then represented as sums of the corresponding character n-grams forming the word (Bojanowski et. al. 2016).

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 10: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Visualization of Word Embeddings

• Can lead to visual insights into data (corpus) content: related words tend to cluster in groups

• WEs are defined in high dimensional spaces, their visualization requires dimensionality reduction (2D or 3D): PCA, t-SNE, TensorFlow (Embedding Projector), LargeVis, WordCloud, etc.

• If you are interested only in similarity relation, graph rendering is fine!

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 11: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Evaluating Word Embeddings • Tricky job: the evaluation is targeted on a specific

task: a set of WEs may be very good in one application (e.g. classification) and less satisfactory in another application (e.g. summarization).

• Quality of WEs depends on the corpus size and several hyperparameters (e.g. context size, vector space dimensionality, frequency threshold);

• For evaluations, ground-truth is needed.

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 12: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Experiments based on CoRoLa corpus

• CoRoLa = the largest IPR-cleared reference corpus of contemporary written and spoken Romanian (almost 1 billion tokens), fully processed (Barbu Mititelu et al. 2018).

• Accessible via the text interfaces KorAP (Diewald et al., 2016) and NLPCQP (Ion, 2018) as well as a speech interface OCQP (Păiș, 2018) at corola.racai.ro

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 13: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

CoRoLa-based experiments (cntd)

• The CoRoLa corpus contains various linguistic segmentations and annotations (phoneme, syllable, lemma, part of speech (POS) tagging, syntactic chunking, dependency parsing).

• WEs and their interfaces due to Vasile Păiș

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 14: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

CoRoLa-based experiments (cntd) • We generated different word-vector sets, with the

following parameters:

– Context window: 5 words

– Vector sizes: 100, 200, 300, 400, 500, 600

– Frequency thresholds: 1, 5, 10, 20, 50

– Type of lexical item: word occurences, lemmas, lemmas+POS

• Application that generated the WEs: fastText

• The wordform WEs are freely downloadable at http://89.38.230.23/word_embeddings/

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 15: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Evaluating Word Embeddings for Similarity test

• Several test-sets: WordSimilarity-353 dataset (also known as Finkelstein-353)- manually translated in Romanian by Hassan & Mihalcea and Păis’s test-sets on countries and their capitals (SET1 – 1892 questions and the their answers and SET2 – 462 questions and their answers)

• We compared the performances of WEs based on Wikipedia (Bojanovski et al., 2016) and based on CoRoLa (Păiș&Tufiș, 2018)

• Recently, additional WEs and tests for lemmas and lemmas+POS; not surprising the WEs are slightly different

26.09.2018

DRuKoLA Workshop - Bucharest September 26-28

Page 16: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Best results (wordforms)

Model Dim/freq SET1 SET2 Corr. on

WS-353

Wikipedia 300/20 26% 63% 54

CoRoLa 300/20 35% 74% 52

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 17: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

CoRoLa 300/20 35% 72% n.a.

Best results (lowercased lemma)

Model Dim/freq SET1 SET2 Corr. on

WS-353

Best results (lemma)

Model Dim/freq SET1 SET2 Corr. on

WS-353

CoRoLa 400/50 44% 79% n.a.

Page 18: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

CoRoLa 300/50 38% 70% n.a.

Best results (POS+ lemma)

Model Dim/freq SET1 SET2 Corr. on

WS-353

Page 19: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Nearest neighbor examples obtained with the CoRoLa model (wordforms)

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Word spania ilie euro sibiu fizician

1 portugalia dumitru usd brașov biofizician

2 franța stoica dolari cluj astrofizician

3 italia gheorghe milioane arad fizicianul

4 grecia valeriu miliarde sighișoara matematician

5 olanda florea forinți oradea geofizician

Word Np_Spania Np_Ilie Nc_euro Np_Sibiu Nc_fizician

1 Np_Portugalia Np_Dumitru Nc_dolar Np_Sighișoara Nc_astrofizician

2 Np_Franța Np_Florea Nc_leu Np_Brașov Nc_matematician

3 Np_Italia Np_Gheorghe Nc_mld Np_Arad Nc_biolog

4 Np_Grecia Np_Constantin Nc_mil Np_Cluj Nc_chimist

5 Np_Mexic Np_Vasilica Af_euro Np_Bacău Af_metafizician

Nearest neighbor examples obtained with the CoRoLa model (lemma+MSD)

Page 20: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Testing and visualizing CoRoLa WEs

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 21: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 22: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 23: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 24: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 25: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Wordform-based WE

• http://89.38.230.23/word_embeddings/ - download vectors and analogy game

• http://89.38.230.23/word_embeddings/view/ - t-SNE most frequent word-forms

• http://89.38.230.23/word_embeddings/view/similar.html - t-SNE most similar to a specific word-form

• http://89.38.230.23/word_embeddings/view/graph.html - graph of most similar to a specific word-form

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 26: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Lemma-based WE

• http://89.38.230.23/word_embeddings_lemma/ (analogy game)

• http://89.38.230.23/word_embeddings_lemma/view/ (t-SNE most frequent k)

• http://89.38.230.23/word_embeddings_lemma/view/similar.html (t-SNE similar to w)

• http://89.38.230.23/word_embeddings_lemma/view/graph.html (graph of similar to w)

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 27: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Lemma+MSD-based WE

• http://89.38.230.23/word_embeddings_lemma_msd/ (analogy game)

• http://89.38.230.23/word_embeddings_lemma_msd/view/ (t-SNE most frequent k)

• http://89.38.230.23/word_embeddings_lemma_msd/view/similar.html (t-SNE similar to w)

• http://89.38.230.23/word_embeddings_lemma_msd/view/graph.html (graph of similar to w)

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 28: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

References 1. V. Păiș, D. Tufiș, “Computing Distributed Representations of Words Using the COROLA Corpus”, Proc. Ro. Acad.,

Series A, Volume 19, No. 2, pp. 403-410, Bucharest, 2018

2. J.R. Firth, “Papers in Linguistics 1934–1951”, (1957) London: Oxford University Press.

3. T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient Estimation of Word Representations in Vector Space”, arXiv:1301.3781, 2013

4. Y. Bengio, R. Ducharme, P. Vincent, “A neural probabilistic language model”, Journal of Machine Learning Research, 3:1137-1155, 2003

5. P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, “Enriching Word Vectors with Subword Information”, arXiv:1607.04606, 2016

6. V. Barbu Mititelu, D. Tufiș, E. Irimia, The Reference Corpus of Contemporary Romanian Language (CoRoLa), in Proceedings of the 11th Language Resources and Evaluation Conference – LREC’18, Miyazaki, Japan, European Language Resources Association (ELRA), 2018

7. Bański, P., Diewald, N., Hanl, M., Kupietz, M., Witt, A. Access Control by Query Rewriting. The Case of KorAP. In Proceedings of the Ninth Conference on International Language Resources and Evaluation (LREC’14). Reykjavik, European Language Resources Association (ELRA), 2014

8. Diewald, N., Hanl, M., Margaretha, E., Bingel, J., Kupietz, M., Bański, P. and A. Witt KorAP Architecture – Diving in the Deep Sea of Corpus Data. In: Calzolari, Nicoletta et al. (eds.): Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portoroz, European Language Resources Association (ELRA), 2016.

9. S. Hassan, R. Mihalcea, “Cross-lingual semantic relatedness using encyclopedic knowledge”, In Proc. EMNLP, 2009

10. L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin, “Placing search in context: the concept revisited”, In WWW, pages 406–414, 2001

11. C. Spearman, “The proof and measurement of association between two things”, The American Journal of Psychology, 15(1):72–101, 1904

12. https://www.youtube.com/watch?v=D-ekE-Wlcds

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28

Page 29: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

DRuKoLA Workshop - Bucharest September 26-28

Page 30: CoRoLa-based Word Embeddings - IDS Mannheim€¦ · CoRoLa-based Word Embeddings Dan Tufiș, Vasile Păiș ICIA-MD, Romanian Academy 26.09.2018 DRuKoLA Workshop - Bucharest September

Good news! • On 11 September 2018, European Parliament voted in favor of the

resolution on “Language equality in the digital age”.

• The EP calls "on the Commission and the Member States to develop strategies and policy action to facilitate multilingualism in the digital market; requests, in this context, that the Commission and the Member States define the minimum language resources that all European languages should possess, such as data sets, lexicons, speech records, translation memories, annotated corpora and encyclopaedic content, in order to prevent digital extinction".

• http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+REPORT+A8-

2018-0228+0+DOC+XML+V0//EN&language=en

26.09.2018 DRuKoLA Workshop - Bucharest September

26-28


Recommended