A semi-automatic annotation tool for cooking video

A Semi-automatic Annotation Tool For Cooking Video

Simone Biancoa, Gianluigi Cioccaa, Paolo Napoletanoa, Raimondo Schettinia, RobertoMargheritab, Gianluca Marinic, Giorgio Gianformec, Giuseppe Pantaleoc

aDISCo (Dipartimento di Informatica, Sistemistica e Comunicazione)Universita degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milano, Italy;

bAlmaviva S.p.a., cAlmawave S.r.l.Centro Direzionale Business Park, Via dei Missaglia n. 97, Edificio B4, 20142 Milano, Italy.

ABSTRACT

In order to create a cooking assistant application to guide the users in the preparation of the dishes relevant totheir profile diets and food preferences, it is necessary to accurately annotate the video recipes, identifying andtracking the foods of the cook. These videos present particular annotation challenges such as frequent occlusions,food appearance changes, etc. Manually annotate the videos is a time-consuming, tedious and error-prone task.Fully automatic tools that integrate computer vision algorithms to extract and identify the elements of interest arenot error free, and false positive and false negative detections need to be corrected in a post-processing stage. Wepresent an interactive, semi-automatic tool for the annotation of cooking videos that integrates computer visiontechniques under the supervision of the user. The annotation accuracy is increased with respect to completelyautomatic tools and the human effort is reduced with respect to completely manual ones. The performance andusability of the proposed tool are evaluated on the basis of the time and effort required to annotate the samevideo sequences.

Keywords: Video annotation, object recognition, interactive tracking

1. INTRODUCTION

The annotation of image and video data of large datasets is a fundamental task in multimedia informationretrieval1–3 and computer vision applications.4–9

The manual generation of video annotations by a user is a time-consuming, tedious and error-prone task: infact, typical videos are recorded with a frame rate of 24-30 frames per second; even a short video of 60 secondswould require the annotation of 1440-1800 frames. Theoretically, fully automatic tools that integrate computervision algorithms to extract and identify the elements of interest across frames, should be developed. Unfor-tunately, state-of-the-art algorithms such as image and video segmentation, object detection and recognition,object tracking, and motion detection,10–12 are not error free, and false positive and false negative detectionswould require a human effort to correct them in a post-processing stage. As a consequence, several efficientsemi-automatic visual tools have been developed.13–17 Usually, such tools, that support the annotator with basiccomputer vision algorithms (i.e. key frame detection, motion and shape linear interpolation, etc.), have demon-strated to be very effective in terms of the number of user interactions, user experience, usability, accuracy andannotation time.15 The most recent trend is the development of tools that integrate computer vision algorithms(such as unsupervised/supervised object detection, object tracking, etc.) that assist humans or cooperate withthem to accomplish labelling tasks.18,19

In this paper we present a tool for interactive, semi-automatic video annotation that integrates customizedversions of well known computer vision algorithms, specifically adapted to work in an interactive framework.The tool has been developed and tested within the Feed for Good project, described later in Sec. 2, to annotatevideo recipes, but it can be easily adapted and used to annotate videos from different domains as well.

Simone Bianco: [email protected], Gianluigi Ciocca: [email protected], Paolo Napoletano: [email protected], Raimondo Schettini: [email protected], Roberto Margherita: [email protected],Gianluca Marini: [email protected], Giorgio Gianforme: [email protected], Giuseppe Pantaleo:[email protected]

The integration of computer vision techniques, under the supervision of the user, allows to increase theannotation accuracy with respect to completely automatic tools reducing at the same time the human effort withrespect to completely manual ones. Our tool includes different computer vision modules for object detectionand tracking within an incremental learning framework. The object detection modules aim at localizing andidentifying the occurrences of pre-defined objects of interest. For a given frame, the output of an object detectoris a set of bounding boxes and the detected object identities. The object tracking modules aim at propagatingidentities of detected objects across the video sequence. The objects identified in previous frames are used asinputs and associations with the localized objects are given as outputs. The output of the tracking modules canbe also used as feedback to the object detection modules.

The annotation tool also provides an interactive framework that allows the user to: browse the annotationresults using an intuitive graphical interface; correct false positive and false negative errors of the computervision modules; add new instances of objects to be recognized.

The paper is structured as follows: section 2 describes the context within which the tool has been developedillustrating the challenges in annotating our cooking videos. Section 3 describes the design of the tool, itsfunctionalities and the user interactions. The system’s usability is assessed by different users and the results areshown in section 4. Finally, section 5 concludes the paper.

2. PROBLEM DEFINITION

The tool is realized in the context if the Feed for Good project, which aims at promoting food awareness. Amongits objectives there is the creation of a cooking assistant application to guide the users in the preparation of thedishes relevant to their profile diets and food preferences, illustrating the actions of the cook and showing, atrequest, the nutrition properties of foods involved in the recipe. To this end it is necessary to accurately annotatethe video recipes with the steps of foods processing, identities and locations of the processed food and cookingactivities.

The cooking videos have been acquired in a professional kitchen with stainless steel worktop. The videos havebeen recorded by professional operators using three cameras: one central camera which recorded the whole scenewith wide shots, and two side cameras for mid shots, medium close ups, close ups, and cut-ins. A schematicrepresentation of the acquisition setup is drawn in Fig. 1.

Figure 1. Disposition of the digital cameras with respect to the kitchen worktop.

The video recipes are HD quality videos with a vertical resolution of 720 pixels (1280×720) at 24 frame perseconds and compressed in MPEG4. The videos were acquired with the aim of being aesthetically pleasing and

useful for the final user. The shooted videos were video edited to obtain the final videos. The edited videos area sequence of shots suitably chosen from those captured by the three cameras in order to clearly illustrate thesteps in the recipe. Figure 2 shows a visual summary of the “Tegame di Verdure” recipe (the summary has beenextracted using the algorithm in20).

Figure 2. Visual summary of the video sequence “Tegame di Verdure”.

With respect to other domains, our cooking domain presents particular challenges such as frequent occlusionsand food appearance changes. An example showing a typical case where a cucumber is being chopped is reportedin Figure 3.

3. TOOL DESCRIPTION

The proposed tool has been developed using C/C++, Qt libraries21 for the GUI and Open Computer Visionlibraries22 for computer vision algorithms. The system handles a video annotation session as a project . A videois divided in shots that are automatically detected from an Edit Decision List file (EDL) provided as input (seeFigure 4), or from a shot detection algorithm. Each annotation session must be associated to a list of items

Figure 3. How food changes appearance during cooking. In this sequence a cucumber is being finely chopped.

TITLE: R_sogliola mugnaia_burro_EDIT

001 AX V C00:01:24:22 00:01:29:01 00:00:00:00 00:00:04:04* FROM CLIP NAME: R_sogliola mugnaia_burro_S_CAM01





Figure 4. Excerpt from an EDL file. The file describes from which source video each shot have been taken and its originaland edited positions.

provided as text file during the project creation procedure. Items can be grouped in categories, for example, inthe case of cooking domain, we have food and kitchenware categories (see Table 1). An annotated item is enclosedby a colored (dashed or solid) rectangle (namely a bounding box, bbox ). Different colors represent different objectcategories: for instance, green stands for food and yellow for kitchenware (see Fig. 5). Solid rectangles standfor annotations that have been manually obtained, while dashed rectangles stand for annotations obtained byan automatic algorithm.

3.1 User interface

The graphical user interface (GUI) of the proposed tool is presented in Fig. 5. The menu bar on the top allowsuser to handle a project: open, create, save and close operations. Not considering the menu bar (in the top part)and the status bar (in the bottom part), GUI can be divided in two parts.

The upper part contains video related information: list of shots, list of items and more important a videobrowser which allows the user to seek through frames and sequentially browse shots. The list of shots is locatedon the left side and contains click-able items so allowing the user to browse shots. On the right side we have thelist of items to annotate (List) and the list of already annotated items in the sequence (Annotated). Each listcan be accessed by browsing each category of items. For instance if we want to annotate a sample of Food, wehave to choose Annotated → Food → Oil. The new Oil item will be named by adding a unique identificationnumber (e.g. 01) to the object name. In this way every object will have a meaningful, unique name as identifier(e.g. Oil 01 ). If we want to modify an annotated Oil with identifier 01, we have to choose Annotated → Food→ Oil 01 from the Annotated list of items.

The lower part of the GUI contains the time-line of the annotated items. Each line reports how the state of agiven annotated object changes along the frames: locked or unlocked existing and locked or unlocked not existing(see Fig. 9). The meaning of such states will be clarified later.

id Category Item1 Food Spinach2 Food Basil3 Food Salt4 Food Oil5 Kitchenware Plastic wrap6 Kitchenware Pan

Table 1. Example of list of items.

The status bar contains a time data viewer, such as shot, frame and timecode (e.g. SHOT 1, FRAME 162,TIMECODE 00:00:06:12) and a viewer of the linear interpolation status (e.g. LINEAR INTERPOLATION:ENABLED).

Figure 5. Interactive Video Annotation Tool GUI.

3.2 User-Tool interaction

The user can interact with the tool through click-able buttons, drag & drop operations, context menus andshort-cuts. The interface includes standard video player control buttons (play/stop and time slider) and shotsbrowsing buttons (next, previous). Drag and drop operations are allowed only on the lists of items in the rightpart of the GUI. The tool provides three different context menus. One can be activated on the bounding box ofan item, another can be activated on the time-line of an item and on each box (time step) of each line, and thelast on an item from the list.

All the operations achievable through buttons and context menus can also be performed by selecting theappropriate area and then using short-cuts. For instance, short-cuts of the video player are: play, backward,forward, prev. shot, next shot. In Fig. 6 we show a concept of a customized controller specifically designed to

Figure 6. GUI of a software application specifically designed for Tablet devices. On the bottom-right a multi-touchmouse/track-pad. On the top right and left side short-cut buttons.

Figure 7. Customized keyboard with shortcuts.

interact with this tool. Such a controller can be a software application for tablet or a hardware device. In Fig.7 we also show how short-cuts map into a regular keyboard.

3.2.1 Manual annotations

The user annotates a new item by firstly choosing it from the list on the right side of the GUI, and later bydragging and dropping it on the video frame. Once the item is dropped on the image the user can draw arectangular box around the object. Users can also re-annotate an existing item. In this case it must be chosenfrom the list of annotated items. Change of size and position of the bounding box can be manually obtained bymodifying the rectangular shape around the object. Each time the user modifies a bounding box a forward andbackward linear interpolation algorithm is triggered, unless has been explicitly disabled.

Several options to delete an item can be activated by using context menus or short-cuts: delete of an itemin a given frame, delete of an item in all shots, delete of an item in a given range of frames, and from a givenframe until the end of the shot (short-cuts: delete, delete line, delete range and delete from respectively, see Fig.6 and Fig. 7). During the annotation process bounding boxes can be hidden to prevent that overlapping objectsmay be confused (short-cut: hide/show).

3.2.2 Automatic annotations

Automatic annotations can be provided by several algorithms embedded in the system: linear interpolation,template-based tracking and supervised object detection.

Figure 8. Finite state machine describing the interactive annotation of an item.

As already discussed in the previous section, the linear interpolation is automatically triggered each time auser modifies a bounding box: it can be activated/deactivated by using context menus or the short-cut linearinterp. In the case of the template-based tracking, the user can trigger it by firstly creating a new annotateditem, or by selecting an existing bounding box, and then by selecting in the context menu the option: objectdetection → unsupervised (alternatively by using the related short-cut: unsup. obj. det.).

The supervised object detection can be activated by selecting an item from the list on the right side of theGUI and then by selecting in the context menu the option: object detection → supervised (alternatively by usingthe related short-cut: superv. obj. det.). This class of algorithms needs a learned template to work, therefore ifsuch model is not available for that object, the supervised object detection option is disabled. For this reason,the tool allows users to crop object templates to be used later for training a supervised object detector. Thetool allows to crop a template from a visible item in a given frame and several templates from an item in all thevisible time step (short-cuts: insert templ. and insert all templ. respectively).

3.2.3 Interaction between manual and automatic annotations

The annotation of an item can be manually or automatically provided. To handle the interactions of theannotations provided by the user with ones provided by the algorithms, we have introduced the concept of lockedand unlocked objects. This concept is related to a specific object at time instant t. If the annotation at time thas been provided or modified by the user, then the state of the annotated item, independently of its presenceat time t, is locked. On the contrary if the annotation at time t has been provided or modified by an algorithm,then the state of the annotated item, independently of its presence at time t, is unlocked. Only the user canmodify the state of annotated items changing it from locked to unlocked and vice-versa (see Fig. 6 and Fig. 7for related short-cuts).

In Fig. 8 we show a finite state machine describing all the possible interactions between manual and automaticchange of annotations. In Fig. 9 we show how different states of the annotations are visualized in the timeline.For each time instant and independently of the presence of the item, small shadow boxes, located in the bottompart of the box at time t, indicate that the current state is ”locked”, while the absence of such small boxesindicate that the current state is ”unlocked”.

4. SYSTEM EVALUATION

Different users tested our tools for several days annotating different video recipes. A questionnaire in two partswas administered to these users in order to collect their impression about the usability of the tool and its

Figure 9. Time-line of an annotated item. Each bar displays the item state within a given frame. Grey bars correspondto the absence of the item (e.g. out of frame, or occluded). Light-green bars indicate that the item is visible. Dark-greenhalf bars represent a locked state, that is, the item state can be modified only by manual intervention. (Please refer tothe on-line version of the paper for color references.)

# Statement 1 2 3 4 5

1. I think that I would like to use this system frequently ?2. I found the system unnecessarily complex ?3. I thought the system was easy to use ?4. I would need the support of a technical person to be able to use this system ?5. I found the various functions in this system were well integrated ?6. I thought there was too much inconsistency in this system ?7. I would imagine that most people would learn to use the system very quickly ?8. I found the system very cumbersome to use ?9. I felt very confident using the system ?

10. I needed to learn a lot of things before I could get going with this system ?

11. The timeline is not very useful ?12. Too many input/interactions are required to obtain acceptable results ?13. The keyboard short-cuts are useful ?14. It is difficult to keep track of the already annotated items ?15. The drag’n’drop mechanism to annotate new items is too slow ?16. I think there is too much information displayed in too many panels ?17. The system user interface is easy to understand ?18. Semi-automatic annotation algorithms are too slow ?19. I prefer using only manual/basic annotation functionalities ?20. I needed to correct many errors made by the semi-automatic annotation algorithms ?

Table 2. The usability questionnaire administered to the iVAT users. The numerical scale goes from ‘strongly disagree’to ‘strongly agree’ (1 → ‘strongly disagree’, 2 → ‘disagree’, 3 → ‘neutral’, 4 → ‘agree’, 5 → ‘strongly agree’).

functionalities. The first part was inspired by the System usability Scale (SUS) questionnaire developed by JohnBrooke at DEC (Digital Equipment Corporation).23 It is composed of statements related to different aspectsof the experience, and the subjects were asked to express their agreement or disagreement with a score takenfrom a Likert scale of five numerical values: 1 expressing strong disagreement with the statement, 5 expressingstrong agreement and 3 expressing a neutral answer. The second part of the questionnaire focuses more on thefunctionalities of the annotation tool and was administered with the same modalities.

The results of the questionnaire are reported in Table 2. The score given by the users are summarized bytaking the median of all the votes. With respect to the usability, it can be seen that the tool has been ratedpositively with votes very similar for all the first ten statements. On average the overall system has been evaluatedeasy to use. With respect to the tool functionalities, the votes are more diverse among the ten statements eventhough the functionalities are judged positively. From the users’ responses, it can be seen that the interactivemechanism can efficiently support the annotation of the videos. The semi-automatic algorithms, although notvery precise in the annotation, can give a boost in the annotation time and require only few corrections to obtainthe desired results. The best rated functionalities of the tool are the keyboard short-cuts and the graphical userinterface. The short-cuts allow the users to interact with the system with mouse and keyboard simultaneouslyincreasing the annotation efficiency. The graphical interface is easy to understand and allows to keep track onthe annotated items with clear, visual hints. While designing the interface we were worried that the amount ofdisplayed information would have been too much. An interview with the users proved the contrary. However,as suggested by the users, the time-line panel should be further improved. Although intuitive and easy to

understand, the user considered to be useful to add the capability to zoom-in and zoom-out the time-line. Whenthe number of frames is very large, they felt very tedious to scroll the panel back and forth in order to seek thedesired frame interval. A resizeable time-line that could be made to fit the size of the panel will be a welcomeaddition to the tool.

With respect to the tool’s usage, initially some users preferred to start the annotation process by usingonly the manual functionalities coupled with the interpolation and then adjust the results. Other users firstexploited the automatic or semi-automatic annotation functionalities and then manually modified the results.After having acquired familiarity with the tools, all the users start to mix the manual and semi-automaticannotation functionalities. One usage pattern common to all the users is that they start the annotation processfollowing the temporal order of the frames, that is, they start form the beginning of the video sequence and moveforward.

5. CONCLUSIONS

In this paper we presented an interactive, semi-automatic tool for the annotation of cooking videos. The toolincludes different computer vision modules for object detection and object tracking within an incremental learningframework. The integration of computer vision techniques, under the supervision of the user, allows to increasethe annotation accuracy with respect to completely automatic tools reducing at the same time the human effortwith respect to completely manual ones.

The annotation tool provides an interactive framework that allows the user to: browse the annotation resultsusing an intuitive graphical interface; correct false positive and false negative errors of the computer visionmodules; add new instances of objects to be recognized.

A questionnaire was administered to the users that used our tool in order to collect their impression aboutthe usability of the tool and its functionalities. The users rated positively the tool, and on average the overallsystem has been evaluated easy to use. With respect to the tool functionalities, the users found the interactivemechanism an efficient support for the video annotation. The best rated functionalities of the tool are thekeyboard short-cuts and the graphical user interface.

As future work we plan to extend the annotation tool to include the observations emerged from the users’interviews. We plan also to customize and include a larger number of computer vision algorithms, which will bespecifically adapted to work fine in an interactive framework.

ACKNOWLEDGMENTS

The R&D project “Feed for good” is coordinated by Almaviva and partially supported by Regione Lombardia(www.regione.lombardia.it). The University of Milano Bicocca is supported by Almaviva. The authors thankAlmaviva for the permission to present this paper.

REFERENCES

[1] Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., and Jain, R., “Content-based image retrieval atthe end of the early years,” IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000).

[2] Datta, R., Joshi, D., Li, J., James, and Wang, Z., “Image retrieval: Ideas, influences, and trends of the newage,” ACM Computing Surveys 39, 2007 (2006).

[3] Hu, W., Xie, N., Li, L., Zeng, X., and Maybank, S., “A survey on visual content-based video indexing andretrieval,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 41(6),797 –819 (2011).

[4] “The pascal visual object classes challenge.” http://pascallin.ecs.soton.ac.uk/challenges/VOC/.

[5] “Virat video dataset.” http://www.viratdata.org.

[6] “Pets: Performance evaluation of tracking and surveillance.” www.cvg.cs.rdg.ac.uk/slides/pets.html.

[7] “Trecvid: Trec video retrieval evaluation.” http://trecvid.nist.gov.

[8] Hu, W., Tan, T., Wang, L., and Maybank, S., “A survey on visual surveillance of object motion andbehaviors,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 34(3),334 –352 (2004).

[9] “Intelligent multi-camera video surveillance: A review,” Pattern Recognition Letters 34(1), 3 – 19 (2013).

[10] Pal, N. R. and Pal, S. K., “A review on image segmentation techniques,” Pattern Recognition 26(9), 1277– 1294 (1993).

[11] Yilmaz, A., Javed, O., and Shah, M., “Object tracking: A survey,” ACM Comput. Surv. 38(4) (2006).

[12] Zhang, C. and Zhang, Z., “A survey of recent advances in face detection,” Technical report, MicrosoftResearch (2010).

[13] Torralba, A., Russell, B., and Yuen, J., “Labelme: Online image annotation and applications,” Proceedingsof the IEEE 98(8), 1467 –1484 (2010).

[14] Yuen, J., Russell, B., Liu, C., and Torralba, A., “Labelme video: Building a video database with humanannotations,” in [Computer Vision, 2009 IEEE 12th International Conference on ], 1451 –1458 (2009).

[15] Vondrick, C., Patterson, D., and Ramanan, D., “Efficiently scaling up crowdsourced video annotation,”International Journal of Computer Vision , 1–21.

[16] Mihalcik, D. and Doermann, D., “The design and implementation of viper,” (2003).

[17] Ali, K., Hasler, D., and Fleuret, F., “Flowboost: Appearance learning from sparsely annotated video,” in[Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on ], 1433 –1440 (june 2011).

[18] Kavasidis, I., Palazzo, S., Di Salvo, R., Giordano, D., and Spampinato, C., “A semi-automatic tool fordetection and tracking ground truth generation in videos,” in [Proceedings of the 1st International Workshopon Visual Interfaces for Ground Truth Collection in Computer Vision Applications ], VIGTA ’12, 6:1–6:5,ACM, New York, NY, USA (2012).

[19] Yao, A., Gall, J., Leistner, C., and Van Gool, L., “Interactive object detection,” in [Computer Vision andPattern Recognition (CVPR), 2012 IEEE Conference on ], 3242 –3249 (june 2012).

[20] Ciocca, G. and Schettini, R., “An innovative algorithm for key frame extraction in video summarization,”Journal of Real-Time Image Processing 1, 69–88 (2006).

[21] “Qt framework.” http://qt-project.org.

[22] “Open computer vision libraries - opencv.” http://opencv.org.

[23] Brooke, J., “SUS: A Quick and Dirty Usability Scale,” in [Usability Evaluation in Industry ], Jordan, P. W.,Thomas, B., Weerdmeester, B. A., and McClelland, I. L., eds., Taylor & Francis., London (1996).

Date post:	28-Nov-2023
Category:	Documents
Upload:	independent
View:	0 times
Download:	0 times

A semi-automatic annotation tool for cooking video

Documents