Date post: | 15-Apr-2017 |
Category: |
Data & Analytics |
Upload: | aist |
View: | 104 times |
Download: | 0 times |
Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
TITLE OF PRESENTATION (FORMAT: TAHOMA 27, UPPER CASE)
Subtitle (FORMAT: TAHOMA 22)
APPLYING WORD EMBEDDINGS TO LEVERAGE KNOWLEDGE AVAILABLE IN ONE LANGUAGE TO SOLVE A TEXT CLASSIFICATION PROBLEM IN ANOTHER LANGUAGE
Andrew Smirnov and Valentin [email protected]
AIST 2016
2Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
CONTENTS
The problemWord embeddingsKnowledge transferResultsNew resultsConclusions
3Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
CALL STEERING IN DIFFERENT LANGUAGES
Low amount of training data in a target language
Up to 40 classes Classifier has to be build rapidly
Our goal is to be able to build a classifier having only class titles and 1-5 artificially generated examples for each class
THE PROBLEM
«У меня вот просто не технический вопрос, а просто можно ли во время отпуска отключить вот этот пакет*»Приостановка услуг**
* My question is not a technical one, I simply want to suspend this package while I'm on vacation** Service suspensionTRAINING DATA
6000 users’ requests in Russian 250 manual translations from
Russian into Kazakh divided on development and test sets
4Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
WORD EMBEDDINGS
The CBOW architecture predicts the current word based on thecontext, and the Skip-gram predicts surrounding words given the current wordMikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)
DETAILS
CBOW Training set for Russian: ~200m
tokensConversations, books, news articles
Training set for Kazakh: ~30m tokens Kazakh Wikipedia and news articles
Vector representation dimension is 200 for Russian and 100 for Kazakh
5Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
KNOWLEDGE TRANSFER
Possible categoriesсервисы / servicesбаланс / balanceинтернет / internetнеисправность интернет / internet failure….
Transfer destination
Target -> Source -> Classify
Source -> Target -> Build Classifier -> Classify
Transfer mechanism
Manual translationAutomatic translationSemantic vector space
6Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
KNOWLEDGE TRANSFER
VECTOR SPACE TRANSFORMATION APPROACH*
Translate a set of words (5000 most frequent ones from the training corpus)
Train a linear model by minimizing L2 distance
Apply the transformation and build kNN classifier
*Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168(2013).
7Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
RESULTS
Leave one out cross-validation results for kNN classifier on Kazakh language
8Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
NEW RESULTS
Classification accuracy for kNN and CNN (not leave one out)
Classification accuracy for 10 classes
9Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
CONCLUSIONS
Knowledge transfer allows to achieve reasonable classification accuracy for low-resource tasks
CNN and translation strategies produce better results
We want to do better
10Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
THANK YOU
CONTACTS
Russia 4 Krasutskogo street, St. Petersburg, 196084Tel.: +7 812 325-8848 Fax: +7 812 327 9297Email: [email protected]
USASuite 316, 369 Lexington aveNew York, NY, 10017Tel.: +1 646 237 7895Email: [email protected]
ABOUT THE COMPANY
STC-Innovations is a leader in the multimodal biometric market. STC-Innovations develops multimodal biometric solutions based on person-identifying technologies via voice, face and other noncontact biometric features.
STC-Innovations is a spin-off company of the Speech Technologies Center, leading global provider of innovative systems in high-quality recording, audio and video processing and analysis, speech synthesis and recognition, and real-time, high-accuracy voice and facial biometrics solutions with over 20 years of research, development and implementation experience in Russia and internationally. STC is ISO-9001: 2008 certified.
Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
AIST 2016
11Financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
CLIENTS & PARTNERS
ФСИН России
Минобороны России
ФСБ Росси
и
МВД России
МЧС России
Минкомсвязь России МВД
Эквадора
COMMUNICATION
FINANCE & INSURANCE
TRANSPORT
MINING & ENERGY
GOVERNMENT
SPORTS & ENTERTAINMENT
MEDICINE
МВД Мексик
и