Date post: | 15-Apr-2017 |
Category: |
Data & Analytics |
Upload: | francisco-manuel-rangel-pardo |
View: | 80 times |
Download: | 1 times |
AI4Trust
ARTIFICIAL INTELLIGENCE FOR BUILDING TRUST
MASTER’S DEGREE IN ARTIFICIAL INTELLIGENCE, PATTERN RECOGNITION AND DIGITAL IMAGINGUNIVERSITAT POLITÈCNICA DE VALÈNCIA
March 2, 2017
Francisco Manuel Rangel Pardo
AI4
Trus
t - M
IAR
FID
ALC
201
7au
torit
as 8 changes that will transform humanity forever
1. Rewrite genetics2. Discover new materials3. Govern a disenchanted world4. Teach machines to learn5. Find a universal energy6. Win to infections7. Crop with little water8. Life on Mars
auto
ritas 8 changes that will transform humanity forever
1. Rewrite genetics2. Discover new materials3. Govern a disenchanted world4. Teach machines to learn5. Find a universal energy6. Win to infections7. Crop with little water8. Life on Mars
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
To build trust
AI4
Trus
t - M
IAR
FID
ALC
201
7
S
auto
ritas
TRUSTS
TR
AT
EG
Y
INT
EL
LIG
EN
CE
DA
SH
BO
AR
D
AC
TIO
N
ME
TH
OD
S/
TO
OL
S
TR
AIN
ING
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
IntelligenceA
I4Tr
ust -
MIA
RFI
D A
LC 2
017
auto
ritas
We need to answer questions...
… to know what questions to ask
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
Easy!
Big data + artificial
intelligence
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
In real time!!
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
Approx. only 2% of contents are geotagged!!
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Language variety identification to improve geotagging ----> Later on
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
I really want Cataluña to be #independent!!
Is the following sentence positive, negative, neutral or none?
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
I really want Cataluña to be #independent!!
Is the following sentence positive, negative, neutral or none?
It depends on the subjectivity of the receiver, e.g.- Positive for Puigdemont- Negative for Rajoy
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
I really want Cataluña to be #independent!!
Is the following sentence positive, negative, neutral or none?
It depends on the subjectivity of the receiver, e.g.- Positive for Puigdemont- Negative for Rajoy
But what about the stance of the author with respect to the main topic?
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
http
s://w
ww
.you
tube
.com
/wat
ch?v
=Yrq
ME
n-5P
i8
- Retrieve and store- Evolution- Words and topics- Labelling- Hashtags- People- Locations- Brands- Polarity, stance- Users, relationships- Gender, age- Author profile- ...
BIG DATA IS A BIG PROBLEM… AND A BIG OPPORTUNITY
tweets/second tweets/minute tweets/hour tweets/dayAI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas What’s the profile of your organisation community?
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas
Political parties
What’s the affinity among political parties and media?
Med
ia
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Some applied technologies
- Age & gender identification ----> PAN@CLEF; EmoGraph- Language variety identification ----> LDR + Word2Vec- Language variety & gender identification ---> PAN 2017- Stance & gender detection ----> IberEval 2017
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Language Variety Identification
Language variety identification aims to detect linguistic variations in order to classify different varieties of the same language.
Language variety identification may be considered an author profiling task, besides a classification one, because the cultural idiosyncrasies may influence the way users use the language (e.g. different expressions, vocabulary…).
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Differences with the state of the art
To discriminate between different varieties of the same language, but with the following differences:
- We focus on different varieties of Spanish, although we tested our approach also with a different set of languages.
- Instead of n-gram based representations, we propose a low dimensionality representation which is helpful when dealing with big data in social media.
- We evaluate the proposed method with an independent test set generated from different authors in order to reduce possible overfitting.
- We make available our dataset to the research community. (https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs)
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas A Low Dimensionality Representation (LDR)
Step 1. Term-frequency - inverse document frequency (tf-idf) matrix:
Step 2. Class-dependent term weighting:
Step 3. Class-dependent document representation:
- Each column is a vocabulary term t- Each row is a document d- wij is the tf-idf weight of the term j in the document i- represents the assigned class c to the document i
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas LDR features
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Alternative representations
- We use the common state-of-the-art representations based on n-grams. We iterated n from 1 to 10, and selected the 1000, 5000 and 10000 most frequent n-grams. The best results were obtained with:- character 4-grams; the 10,000 most frequent- word 1-gram (bag-of-words); the 10,000 most frequent- word 2-grams; the 10,000 highest tf-idf
- Two variations of the continuous Skip-gram model (Mikolov et al.):- Skip-grams- Sentence Vectors
Maximizing the average of the log probability: Using the negative sampling estimator:
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Hispablogs
https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs
- - Completely independent authors between training and test sets
- - Manually collected by social media experts of Autoritas
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Accuracy results with different machine learning algorithms
Significance of the results wrt. the two systems with the highest performance
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas The effect of the pre-processing
Accuracy obtained after removing words with frequency equal or lower than n
(a) Continuous scale (b) Non-continuous scale
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas The effect of the pre-processing
Number of words after removing those with frequency equal or lower than n, and some examples of very infrequent words.
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Evaluation results
*
**
*
**
Representation Accuracy (%)
Skip-gram 72,2
LDR 71,1
SenVec 70,8
BOW 52,7
Char 4-grams 51,5
EmoGraph 39,3
tf-idf 2-grams 32,2
Random baseline 20,0
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Error analysis
Confusion matrix of the 5-class classification
F1 values for identification as the corresponding language variety vs. others
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Features contribution
Accuracy obtained with different combinations of features
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Cost analysis
Complexity of obtaining the features:
Number of features:Representation # Features
LDR 30Skip-gram 300
SenVec 300
EmoGraph 1,100
BOW 10,000
Char 4-grams 10,000
tf-idf 2-grams 10,000
l: number of varieties n: number of terms of the documentm: number of terms in the document that coincides with some term in the vocabularyn m & l<<n
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Robustness
Results obtained with the development set of the DSLCC corpus from the Discriminating between Similar Languages task (2015)
NOTE: Significant results in boldAI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas LDR for age and gender identification
Dataset Genre Lang Age Pos. Gender Pos. NºPartici.
EmoGraph LDR EmoGraph LDR
PAN-AP-2013 Social Media ES 66,24* 62,70 3 63,65* 60,75 6 21
PAN-AP-2014 Social Media ES 45,9 38,16 6 68,6* 56,89 9 9
PAN-AP-2014 Social Media EN 34,2* 31,63 6 53,4 51,42 9 10
PAN-AP-2014 Blogs ES 46,4 46,43 3 64,3 50,00 5 9
PAN-AP-2014 Blogs EN 46,2 38,46 3 71,3 67,95 1 10
PAN-AP-2014 Twitter ES 58,9 56,67 2 73,3 63,33 2 8
PAN-AP-2014 Twitter EN 45,5 52,60 1 72,1 67,53 3 9
PAN-AP-2014 Revisiones EN 30,8 32,28 5 66,1 67,11 5 10
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas Conclusions
LDR outperforms common state-of-the-art representations by 35% increase in accuracy.
LDR obtains competitive results compared with two distributed representation-based approaches that employed the popular continuous Skip-gram model.
LDR remains competitive with different languages and media (DSLCC).
The dimensionality reduction is from thousands to only 6 features per language variety. This allows to deal with big data in social media.
We have applied LDR to age and gender identification with competitive results with respect to the well-behaved EmoGraph.
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas HispaTweets
https://github.com/autoritas/RD-Lab/tree/master/doc/projects/Identificacion-de-la-Variedad-del-Lenguaje-para-la-Mejora-del-Geoposicionamiento-en-Social-Media
- 650 authors per language variety
- 865 tweets per author (avg)- 7 Spanish varieties:
- Argentina- Chile- Colombia- Mexico- Peru- Spain- Venezuela
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas PAN-AP 2017
GENDER AND LANGUAGE VARIETY IDENTIFICATION
ENGLISH SPANISH PORTUGUESE ARABIC
● Australia● Canada● Great Britain● Ireland● New Zealand● United States
● Argentina● Chile● Colombia● Mexico● Peru● Spain● Venezuela
● Brazil● Portugal
● Egypt● Gulf● Levantine● Maghrebi
http://pan.webis.de/clef17/pan17-web/author-profiling.html
- 500 authors per gender and language variety- 100 tweets per author
AI4
Trus
t - M
IAR
FID
ALC
201
7
auto
ritas IberEval 2017
AI4
Trus
t - M
IAR
FID
ALC
201
7
http://stel.ub.edu/Stance-IberEval2017/index.html
GENDER AND STANCE DETECTION WRT. INDEPENDENCE OF CATALONIA
● Tweets in Spanish and Catalan● Annotated with
○ Gender (male / female)○ Stance (favor / against / none)
Language: CatalanStance: FAVOR Gender: FEMALETweet: "15 diplomàtics internacional observen les plebiscitàries, serà que
interessen a tothom menys a Espanya #27S"(‘15 international diplomats observe the plebiscite, perhaps it is of interest to
everybody except to Spain#27S’)