+ All Categories
Home > Data & Analytics > Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in...

Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in...

Date post: 15-Apr-2017
Category:
Upload: aist
View: 104 times
Download: 0 times
Share this document with a friend
11
Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008. TITLE OF PRESENTATION (FORMAT: TAHOMA 27, UPPER CASE) Subtitle (FORMAT: TAHOMA 22) APPLYING WORD EMBEDDINGS TO LEVERAGE KNOWLEDGE AVAILABLE IN ONE LANGUAGE TO SOLVE A TEXT CLASSIFICATION PROBLEM IN ANOTHER LANGUAGE Andrew Smirnov and Valentin Mendelev [email protected] AIST 2016
Transcript
Page 1: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

TITLE OF PRESENTATION (FORMAT: TAHOMA 27, UPPER CASE)

Subtitle (FORMAT: TAHOMA 22)

APPLYING WORD EMBEDDINGS TO LEVERAGE KNOWLEDGE AVAILABLE IN ONE LANGUAGE TO SOLVE A TEXT CLASSIFICATION PROBLEM IN ANOTHER LANGUAGE

Andrew Smirnov and Valentin [email protected]

AIST 2016

Page 2: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

2Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

CONTENTS

The problemWord embeddingsKnowledge transferResultsNew resultsConclusions

Page 3: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

3Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

CALL STEERING IN DIFFERENT LANGUAGES

Low amount of training data in a target language

Up to 40 classes Classifier has to be build rapidly

Our goal is to be able to build a classifier having only class titles and 1-5 artificially generated examples for each class

THE PROBLEM

«У меня вот просто не технический вопрос, а просто можно ли во время отпуска отключить вот этот пакет*»Приостановка услуг**

* My question is not a technical one, I simply want to suspend this package while I'm on vacation** Service suspensionTRAINING DATA

6000 users’ requests in Russian 250 manual translations from

Russian into Kazakh divided on development and test sets

Page 4: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

4Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

WORD EMBEDDINGS

The CBOW architecture predicts the current word based on thecontext, and the Skip-gram predicts surrounding words given the current wordMikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)

DETAILS

CBOW Training set for Russian: ~200m

tokensConversations, books, news articles

Training set for Kazakh: ~30m tokens Kazakh Wikipedia and news articles

Vector representation dimension is 200 for Russian and 100 for Kazakh

Page 5: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

5Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

KNOWLEDGE TRANSFER

Possible categoriesсервисы / servicesбаланс / balanceинтернет / internetнеисправность интернет / internet failure….

Transfer destination

Target -> Source -> Classify

Source -> Target -> Build Classifier -> Classify

Transfer mechanism

Manual translationAutomatic translationSemantic vector space

Page 6: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

6Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

KNOWLEDGE TRANSFER

VECTOR SPACE TRANSFORMATION APPROACH*

Translate a set of words (5000 most frequent ones from the training corpus)

Train a linear model by minimizing L2 distance

Apply the transformation and build kNN classifier

*Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168(2013).

Page 7: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

7Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

RESULTS

Leave one out cross-validation results for kNN classifier on Kazakh language

Page 8: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

8Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

NEW RESULTS

Classification accuracy for kNN and CNN (not leave one out)

Classification accuracy for 10 classes

Page 9: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

9Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

CONCLUSIONS

Knowledge transfer allows to achieve reasonable classification accuracy for low-resource tasks

CNN and translation strategies produce better results

We want to do better

Page 10: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

10Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

THANK YOU

CONTACTS

Russia 4 Krasutskogo street, St. Petersburg, 196084Tel.: +7 812 325-8848 Fax: +7 812 327 9297Email: [email protected]

USASuite 316, 369 Lexington aveNew York, NY, 10017Tel.: +1 646 237 7895Email: [email protected]

ABOUT THE COMPANY

STC-Innovations is a leader in the multimodal biometric market. STC-Innovations develops multimodal biometric solutions based on person-identifying technologies via voice, face and other noncontact biometric features.

STC-Innovations is a spin-off company of the Speech Technologies Center, leading global provider of innovative systems in high-quality recording, audio and video processing and analysis, speech synthesis and recognition, and real-time, high-accuracy voice and facial biometrics solutions with over 20 years of research, development and implementation experience in Russia and internationally. STC is ISO-9001: 2008 certified.

Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

AIST 2016

Page 11: Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another

11Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

CLIENTS & PARTNERS

ФСИН России

Минобороны России

ФСБ Росси

и

МВД России

МЧС России

Минкомсвязь России МВД

Эквадора

COMMUNICATION

FINANCE & INSURANCE

TRANSPORT

MINING & ENERGY

GOVERNMENT

SPORTS & ENTERTAINMENT

MEDICINE

МВД Мексик

и


Recommended