AL4Trust - Artificial Intelligence for Building Trust

AI4Trust

ARTIFICIAL INTELLIGENCE FOR BUILDING TRUST

MASTER’S DEGREE IN ARTIFICIAL INTELLIGENCE, PATTERN RECOGNITION AND DIGITAL IMAGINGUNIVERSITAT POLITÈCNICA DE VALÈNCIA

March 2, 2017

Francisco Manuel Rangel Pardo

AI4

Trus

t - M

IAR

FID

ALC

201

7au

torit

as 8 changes that will transform humanity forever

1. Rewrite genetics2. Discover new materials3. Govern a disenchanted world4. Teach machines to learn5. Find a universal energy6. Win to infections7. Crop with little water8. Life on Mars

auto

ritas 8 changes that will transform humanity forever

1. Rewrite genetics2. Discover new materials3. Govern a disenchanted world4. Teach machines to learn5. Find a universal energy6. Win to infections7. Crop with little water8. Life on Mars

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

To build trust

AI4

Trus

t - M

IAR

FID

ALC

201

7

S

auto

ritas

TRUSTS

TR

AT

EG

Y

INT

EL

LIG

EN

CE

DA

SH

BO

AR

D

AC

TIO

N

ME

TH

OD

S/

TO

OL

S

TR

AIN

ING

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

IntelligenceA

I4Tr

ust -

MIA

RFI

D A

LC 2

017

auto

ritas

We need to answer questions...

… to know what questions to ask

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

Easy!

Big data + artificial

intelligence

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

In real time!!

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

Approx. only 2% of contents are geotagged!!

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Language variety identification to improve geotagging ----> Later on

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

I really want Cataluña to be #independent!!

Is the following sentence positive, negative, neutral or none?

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas



It depends on the subjectivity of the receiver, e.g.- Positive for Puigdemont- Negative for Rajoy

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas



It depends on the subjectivity of the receiver, e.g.- Positive for Puigdemont- Negative for Rajoy

But what about the stance of the author with respect to the main topic?

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

http

s://w

ww

.you

tube

.com

/wat

ch?v

=Yrq

ME

n-5P

i8

- Retrieve and store- Evolution- Words and topics- Labelling- Hashtags- People- Locations- Brands- Polarity, stance- Users, relationships- Gender, age- Author profile- ...

BIG DATA IS A BIG PROBLEM… AND A BIG OPPORTUNITY

tweets/second tweets/minute tweets/hour tweets/dayAI4

Trus

t - M

IAR

FID

ALC

201

7

http://www.youtube.com/watch?v=YrqMEn-5Pi8

https://www.youtube.com/watch?v=YrqMEn-5Pi8

https://www.youtube.com/watch?v=YrqMEn-5Pi8

auto

ritas What’s the profile of your organisation community?

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas

Political parties

What’s the affinity among political parties and media?

Med

ia

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Some applied technologies

- Age & gender identification ----> PAN@CLEF; EmoGraph- Language variety identification ----> LDR + Word2Vec- Language variety & gender identification ---> PAN 2017- Stance & gender detection ----> IberEval 2017

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Language Variety Identification

Language variety identification aims to detect linguistic variations in order to classify different varieties of the same language.

Language variety identification may be considered an author profiling task, besides a classification one, because the cultural idiosyncrasies may influence the way users use the language (e.g. different expressions, vocabulary…).

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Differences with the state of the art

To discriminate between different varieties of the same language, but with the following differences:

- We focus on different varieties of Spanish, although we tested our approach also with a different set of languages.

- Instead of n-gram based representations, we propose a low dimensionality representation which is helpful when dealing with big data in social media.

- We evaluate the proposed method with an independent test set generated from different authors in order to reduce possible overfitting.

- We make available our dataset to the research community. (https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs)

AI4

Trus

t - M

IAR

FID

ALC

201

7

https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs

auto

ritas A Low Dimensionality Representation (LDR)

Step 1. Term-frequency - inverse document frequency (tf-idf) matrix:

Step 2. Class-dependent term weighting:

Step 3. Class-dependent document representation:

- Each column is a vocabulary term t- Each row is a document d- wij is the tf-idf weight of the term j in the document i- represents the assigned class c to the document i

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas LDR features

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Alternative representations

- We use the common state-of-the-art representations based on n-grams. We iterated n from 1 to 10, and selected the 1000, 5000 and 10000 most frequent n-grams. The best results were obtained with:- character 4-grams; the 10,000 most frequent- word 1-gram (bag-of-words); the 10,000 most frequent- word 2-grams; the 10,000 highest tf-idf

- Two variations of the continuous Skip-gram model (Mikolov et al.):- Skip-grams- Sentence Vectors

Maximizing the average of the log probability: Using the negative sampling estimator:

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Hispablogs


- - Completely independent authors between training and test sets

- - Manually collected by social media experts of Autoritas

AI4

Trus

t - M

IAR

FID

ALC

201

7



auto

ritas Accuracy results with different machine learning algorithms

Significance of the results wrt. the two systems with the highest performance

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas The effect of the pre-processing

Accuracy obtained after removing words with frequency equal or lower than n

(a) Continuous scale (b) Non-continuous scale

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas The effect of the pre-processing

Number of words after removing those with frequency equal or lower than n, and some examples of very infrequent words.

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Evaluation results

*

**

*

**

Representation Accuracy (%)

Skip-gram 72,2

LDR 71,1

SenVec 70,8

BOW 52,7

Char 4-grams 51,5

EmoGraph 39,3

tf-idf 2-grams 32,2

Random baseline 20,0

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Error analysis

Confusion matrix of the 5-class classification

F1 values for identification as the corresponding language variety vs. others

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Features contribution

Accuracy obtained with different combinations of features

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Cost analysis

Complexity of obtaining the features:

Number of features:Representation # Features

LDR 30Skip-gram 300

SenVec 300

EmoGraph 1,100

BOW 10,000

Char 4-grams 10,000

tf-idf 2-grams 10,000

l: number of varieties n: number of terms of the documentm: number of terms in the document that coincides with some term in the vocabularyn m & l<<n

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Robustness

Results obtained with the development set of the DSLCC corpus from the Discriminating between Similar Languages task (2015)

NOTE: Significant results in boldAI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas LDR for age and gender identification

Dataset Genre Lang Age Pos. Gender Pos. NºPartici.

EmoGraph LDR EmoGraph LDR

PAN-AP-2013 Social Media ES 66,24* 62,70 3 63,65* 60,75 6 21

PAN-AP-2014 Social Media ES 45,9 38,16 6 68,6* 56,89 9 9

PAN-AP-2014 Social Media EN 34,2* 31,63 6 53,4 51,42 9 10

PAN-AP-2014 Blogs ES 46,4 46,43 3 64,3 50,00 5 9

PAN-AP-2014 Blogs EN 46,2 38,46 3 71,3 67,95 1 10

PAN-AP-2014 Twitter ES 58,9 56,67 2 73,3 63,33 2 8

PAN-AP-2014 Twitter EN 45,5 52,60 1 72,1 67,53 3 9

PAN-AP-2014 Revisiones EN 30,8 32,28 5 66,1 67,11 5 10

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas Conclusions

LDR outperforms common state-of-the-art representations by 35% increase in accuracy.

LDR obtains competitive results compared with two distributed representation-based approaches that employed the popular continuous Skip-gram model.

LDR remains competitive with different languages and media (DSLCC).

The dimensionality reduction is from thousands to only 6 features per language variety. This allows to deal with big data in social media.

We have applied LDR to age and gender identification with competitive results with respect to the well-behaved EmoGraph.

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas HispaTweets

https://github.com/autoritas/RD-Lab/tree/master/doc/projects/Identificacion-de-la-Variedad-del-Lenguaje-para-la-Mejora-del-Geoposicionamiento-en-Social-Media

- 650 authors per language variety

- 865 tweets per author (avg)- 7 Spanish varieties:

- Argentina- Chile- Colombia- Mexico- Peru- Spain- Venezuela

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas PAN-AP 2017

GENDER AND LANGUAGE VARIETY IDENTIFICATION

ENGLISH SPANISH PORTUGUESE ARABIC

● Australia● Canada● Great Britain● Ireland● New Zealand● United States

● Argentina● Chile● Colombia● Mexico● Peru● Spain● Venezuela

● Brazil● Portugal

● Egypt● Gulf● Levantine● Maghrebi

http://pan.webis.de/clef17/pan17-web/author-profiling.html

- 500 authors per gender and language variety- 100 tweets per author

AI4

Trus

t - M

IAR

FID

ALC

201

7

auto

ritas IberEval 2017

AI4

Trus

t - M

IAR

FID

ALC

201

7

http://stel.ub.edu/Stance-IberEval2017/index.html

GENDER AND STANCE DETECTION WRT. INDEPENDENCE OF CATALONIA

● Tweets in Spanish and Catalan● Annotated with

○ Gender (male / female)○ Stance (favor / against / none)

Language: CatalanStance: FAVOR Gender: FEMALETweet: "15 diplomàtics internacional observen les plebiscitàries, serà que

interessen a tothom menys a Espanya #27S"(‘15 international diplomats observe the plebiscite, perhaps it is of interest to

everybody except to Spain#27S’)

Date post:	15-Apr-2017
Category:	Data & Analytics
Upload:	francisco-manuel-rangel-pardo
View:	80 times
Download:	1 times

AL4Trust - Artificial Intelligence for Building Trust

Data & Analytics