Text versus Speech: A Comparison of Tagging Input Modalities for Camera Phones

Text vs. Speech A Comparison of Tagging Input Modalities

for Camera Phones

Research & Development

Mauro Cherubini, Xavier Anguera, Nuria Oliver, and Rodrigo de Oliveira

people do not want to tag their pictures

intro → hypotheses → methodology → results → implications

research question:

Assuming that users are willing to input at least one tag, which input

modality can help the production and retrieval of the pictures?


hypothesis 1

Speech is preferred to text as an annotation mechanism on mobile

phones (objective measure)

Support: - Mitchard and Winkles (2002)


hypothesis 1-bis

Speech annotations are preferred by users even if this means spending more time on the task (subjective measure)

Support: - Perakakis and Potamianos (2008)


hypothesis 2

The longer the tag the larger the advantage of voice over text for

annotating pictures on mobile phones

Support: - Hauptmann and Rudnicky (1990)


hypothesis 3

Retrieving pictures on mobile phones with speech is not faster than with text

(objective measure)

Support: - Mills et al. (2000)


the user study


field study (4 weeks)

controlled experiment

T1 - T2 - T3 - T4

3 experimental conditions: a. Speech only

b. Text only c. Speech and Text


MAMI


features of MAMI

•  processing is done entirely on the mobile phone

•  speech is not transcribed

•  to compare the waveforms of the audio tags, MAMI uses algorithm of Dynamic Time Warping

task 1: remember the tag


stimulus retrieval

Pictures taken during the field trial

task 2: remember the context


stimulus retrieval

TASK 2 PICTURE 1

three little bushes Garden Tree Stairs

task 3: remember the picture


stimulus retrieval

Text Audio tags were converted into

textual tags and vice versa

task 4: remember the sequence


assignment retrieval

TASK 4

Three pictures among the oldest and three pictures among the newest.

metrics


•  time to completion

•  false positives

•  retrieval errors

results H1


results H1-bis

All participants in the BOTH group felt that tagging with text was more effective than tagging with voice.

Voice: 3.33 [0.81], Text: 4.34 [0.81] (Mean [SD]) 1 = completely agree; 5 = completely disagree


results H2


results H3


results H3 - continued

take away 1: �speech is not a given

the advantage of audio as an input modality for tagging pictures on mobile phones is not a given

why? 1. retrieval precision

2. privacy


take away 2: �input mistakes

we address text input mistakes immediately. on the contrary mistakes in audio recordings are less

frequently addressed


take away 3: �memory

speech does not help memorizing the tags


implication 1:�allow multiple modalities

© Pixar, 2008


implication 2:�enable audio inspection


implication 3: �enable modality synesthesia

© Disney, 1940


end�thanks

[email protected] [email protected]

http://www.i-cherubini.it/mauro/blog/ http://research.tid.es/multimedia/

Research & Development

Date post:	16-May-2015
Category:	Technology
Upload:	mauro-cherubini
View:	923 times
Download:	7 times

Text versus Speech: A Comparison of Tagging Input Modalities for Camera Phones

Technology