E. Fleury, ENS de Lyon / Inria https://team.inria.fr/dante/
CLARIN-PLUS workshop — Creation and Use of Social Media Resources, May 2017
https://sosweet.inria.fr
Outline
n Objectives of SoSweet
n Data collections
n Tools
n Future Challenges
2
Objectives #1n Provide a detailed understanding of the dynamic links between: q individuals, q social structure, q and language variation and change
n through the study of: q synchronic variation q diachronic evolution q of the variety of French language observed on
3
Objectives #2n Develop
q interdisciplinary, q computational and data driven approaches
n to handle the enormous amount of collected digital data specific to social media
4
Twitter as a labto study how social structure shapes linguistic variability and change
5
Language on Twitter shows n High variability n High innovation rates
Twitter provides large amount of n Linguistics data n Social data
Not forgetting bias n Technological bias n Communicative bias
6
Corpuslinguis,cs Sociolinguis,cs
NaturalLanguageProcessing Networkscience
Linguis'cDatatweets
SocialDataFollowersnetwork
SynchronyLinguis,cvaria,on/Socialstructure
DiachronyLanguagechange/Socialstructure
Outline
n Objectives of SoSweet
n Data collections
n Tools
n Future Challenges
7
Data collection: tweets
8
I leave in GMT or GMT+1
June 2014 Dec 2019
I tweet in French
Target 500 million tweets n 200 millions n annotated with parts of
speech n Gnip/Datasift/JSON
Processed / TAG (CC Tagset)
Now
Data collection: networks
9
I have followers
June 2014 Dec 2019
and followees target 10 million users
n 2 millions n authored tweets n followees n followers
Now
Data collection: socio-economic data
10
I answer fun quiz and some demographic questions
http://sosweet.ish-lyon.cnrs.fr
Outline
n Objectives of SoSweet
n Data collections
n Tools
n Future Challenges
11
ALMAnaCH Linguistic workbenchn MELT:Part Of Speech tagger
q Coupling annotated corpus / morphosyntactic lexicon
n FRMG — French Meta Grammar q Wide coverage abstract grammatical
description for French n Word Embeddings with Glove/FRMG
q Global Vectors for Word Representation q syntax link / not only co-ocurence
n https://gforge.inria.fr/projects/lingwb/
12
Word2Graph embedding
n Use Glove or word2vect q embedding of the vocabulary to a « space »
n Build a k-closest proximity graph on q Two word close in the space are link q Non symmetric relation
n Run community detection on the proximity graph
q https://gitlab.inria.fr
13
14
15
15
15
16
16
16
18
18
18
Outline
n Objectives of SoSweet
n Data collections
n Tools
n Future Challenges
19
Open SoSweet to the communityn Ethical issues
q Operational Legal and Ethical Risk Assessment Committee (COERLE)
n Licensing issues / Twitter restriction No full tweets distribution Distributing IDs is allowed
n Aggregation n Sampling n On line tools for queries.
20
Thanks
ICARMa#hieuQuignardSandraTeston-BonnardClémentThibertNathalieRossy-GensaneDanielValéro
LidilemJean-PierreChevrotAurélieNardyJuliePeuvergne
AlpageMarieCanditoEricdelaClergerieBenoîtCrabbéBenoîtSagotDjaméSeddah
DanteEricFleuryMartonKarsaiHadrienHoursSa:iJouaberJacobLevy-AbitbolTommasoVenturini