Newtoolsforaudiodescrip1onresearch:theVIWprojectAnnaMatamalaCo-author:MartaVillegasUniversitatAutònomadeBarcelonaTransMediaCataloniaresearchgroupanna.matamala@uab.catLanguagesandtheMedia,Berlin2-4November2016.FFI2015-62522-ERC,2014SGR0027,FFI2015-64038-P(MINECO/FEDER,UE)
Overview
• Whythisproject?• Previousworkonaudiodescrip7on(AD)andcorpora• Projectra7onale• Crea7ngthematerials• Processingthematerials• TheplaAorm
2
Why this project?
• Needforcorporatoanalyseaudiodescrip7onwhichare
• Mul7modal(audio,video,text)• Mul7lingual• Openaccess
• Andallowforincreasingresearch
3
Why this project?
• Fundingforoneyear:EuropaExcelenciacall(FFI2015-62522-ERC)
• Mainresearcher:AnnaMatamala• Postdocresearcher:MartaVillegas
4
AD and corpora
• TIWO(Salway2007)• TRACCE(JiménezHurtadoetal.2010)• MPIIMovieDescrip7ondataset(Rohrbachetal.2015)• PearTreeProject(MazurandKruger2012),inspiredbyChafe(1980)
• Reviersetal.(2015)
5
Challenges of multimodal corpora
• Knight(2011)• Designandinfrastructure• Sizeandscope• Naturalness• Availabilityand(re)usability
• Valen7ni(2013)• Verbalandaudioandvisual• Segmenta7oncriteria• Needtodeviseasoundmethodology
6
The short film
• Shortfilmcommissionedtoafilmdirector(guidelinesbasedonliteraturereview)
• “Whathappenswhile---”,byNúriaNia,inEnglish.• DubbedintoCatalanandSpanishinprofessionalstudio
h]p://pagines.uab.cat/viw/
7
The audio descriptions
• Audiodescrip7onsbyprofessionals(10inEnglish,10inCatalan,10inSpanish):recordedvideoplustext
• Addi7onally:audiodescrip7onsbystudents(volunteersinSpanishandCatalan),onlytext
8
The corpus
9
AUDIODESCRIPTION VERSIONS WORDS
ENGLISH 10 6799
CATALAN 10 6888
SPANISH 10 6191
STUDENTS-CATALAN 7 7354
STUDENTS-SPANISH 10 5185
GLOBAL 17 32,417
The corpus
h]p://pagines.uab.cat/viw/
LinkedtoUAB’sopenaccessrepository
10
Processing the materials
11
mp4
txt eaffile
conll2eafLing.Annotatedtext
Ling.Annotatedtext
Ling.Annotatedtext
webapp
Segmenting and processing
• Linguis7c7ers:AD-unit7er(sentences,chunks,tokens)andCredits7er
• Token:partsofspeech,lemma,andseman7cvalues
• Filmic7ers:scene,shot,sound,character,text.
12
13
Timeline
AudioDesc.txt
Filmicannota1ons
en1en2en3en4… es1es2es3es4… ca1ca2ca3ca4…
EN CAES
Linguis1cannota1ons
ShortMovie.mp4
The web app
• WebappusingSymfonyandahos7ngserviceatUAB
• AllcodedataareavailableatGitHub
• Accesstosourcedataplussomegraphicalvisualiza7ons
14
The web app: source data
• RawmaterialperproviderandpersubcorpustoimportintoELANandintoCQPweb.Alsofilmicannota7onsaseaffile.
• Visualiza7onsforpre-establishedanalyses.
• Accessfrompreviouspagebutalsodirectly:hjp://transmediacatalonia.uab.cat/web/
15
Data and visualisations
• Simplestringsearch• ADunits,sentencesandword’scoun7ngs• ADdistribu7onin7meline• Verbdistribu7onin7meline,withselec7onofverbalseman7cclass.
• ADsimilarity(TedPedersen’sText-Similaritymodule)
16
Data and visualisations
• WordfrequencybyPoS,perprovider.• Seman7ctaggingforverbs,nouns,adjec7ves,andadverbs.
• Htmlversionofeacheaffile,withaccesstovisuals.…andmanyotherfeatures
17
18
19
20
And the future?
• On-goinganalysison• Character• Textonscreen• Spa7o-temporalseongs• Professionalsversusamateurs
• Howtoexpanditintootherlanguages?21
Newtoolsforaudiodescrip1onresearch:theVIWprojectAnnaMatamalaCo-author:MartaVillegasUniversitatAutònomadeBarcelonaTransMediaCataloniaresearchgroupanna.matamala@uab.catLanguagesandtheMedia,Berlin2-4November2016.FFI2015-62522-ERC,2014SGR0027,FFI2015-64038-P(MINECO/FEDER,UE)