Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in...

Post on 15-Apr-2017

104 views 0 download

transcript

Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

TITLE OF PRESENTATION (FORMAT: TAHOMA 27, UPPER CASE)

Subtitle (FORMAT: TAHOMA 22)

APPLYING WORD EMBEDDINGS TO LEVERAGE KNOWLEDGE AVAILABLE IN ONE LANGUAGE TO SOLVE A TEXT CLASSIFICATION PROBLEM IN ANOTHER LANGUAGE

Andrew Smirnov and Valentin Mendelevsmirnov-a@speechpro.com

AIST 2016

2Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

CONTENTS

The problemWord embeddingsKnowledge transferResultsNew resultsConclusions

3Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

CALL STEERING IN DIFFERENT LANGUAGES

Low amount of training data in a target language

Up to 40 classes Classifier has to be build rapidly

Our goal is to be able to build a classifier having only class titles and 1-5 artificially generated examples for each class

THE PROBLEM

«У меня вот просто не технический вопрос, а просто можно ли во время отпуска отключить вот этот пакет*»Приостановка услуг**

* My question is not a technical one, I simply want to suspend this package while I'm on vacation** Service suspensionTRAINING DATA

6000 users’ requests in Russian 250 manual translations from

Russian into Kazakh divided on development and test sets

4Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

WORD EMBEDDINGS

The CBOW architecture predicts the current word based on thecontext, and the Skip-gram predicts surrounding words given the current wordMikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)

DETAILS

CBOW Training set for Russian: ~200m

tokensConversations, books, news articles

Training set for Kazakh: ~30m tokens Kazakh Wikipedia and news articles

Vector representation dimension is 200 for Russian and 100 for Kazakh

5Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

KNOWLEDGE TRANSFER

Possible categoriesсервисы / servicesбаланс / balanceинтернет / internetнеисправность интернет / internet failure….

Transfer destination

Target -> Source -> Classify

Source -> Target -> Build Classifier -> Classify

Transfer mechanism

Manual translationAutomatic translationSemantic vector space

6Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

KNOWLEDGE TRANSFER

VECTOR SPACE TRANSFORMATION APPROACH*

Translate a set of words (5000 most frequent ones from the training corpus)

Train a linear model by minimizing L2 distance

Apply the transformation and build kNN classifier

*Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168(2013).

7Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

RESULTS

Leave one out cross-validation results for kNN classifier on Kazakh language

8Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

NEW RESULTS

Classification accuracy for kNN and CNN (not leave one out)

Classification accuracy for 10 classes

9Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

CONCLUSIONS

Knowledge transfer allows to achieve reasonable classification accuracy for low-resource tasks

CNN and translation strategies produce better results

We want to do better

10Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

THANK YOU

CONTACTS

Russia 4 Krasutskogo street, St. Petersburg, 196084Tel.: +7 812 325-8848 Fax: +7 812 327 9297Email: info@speechpro.com

USASuite 316, 369 Lexington aveNew York, NY, 10017Tel.: +1 646 237 7895Email: sales-usa@speechpro.com

ABOUT THE COMPANY

STC-Innovations is a leader in the multimodal biometric market. STC-Innovations develops multimodal biometric solutions based on person-identifying technologies via voice, face and other noncontact biometric features.

STC-Innovations is a spin-off company of the Speech Technologies Center, leading global provider of innovative systems in high-quality recording, audio and video processing and analysis, speech synthesis and recognition, and real-time, high-accuracy voice and facial biometrics solutions with over 20 years of research, development and implementation experience in Russia and internationally. STC is ISO-9001: 2008 certified.

Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

AIST 2016

11Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.

CLIENTS & PARTNERS

ФСИН России

Минобороны России

ФСБ Росси

и

МВД России

МЧС России

Минкомсвязь России МВД

Эквадора

COMMUNICATION

FINANCE & INSURANCE

TRANSPORT

MINING & ENERGY

GOVERNMENT

SPORTS & ENTERTAINMENT

MEDICINE

МВД Мексик

и