A Framework for Review, Annotation, and Classi …...video segmentation and visual summarization....

A Framework for Review, Annotation, andClassification of Continuous Video in Context

Tobias Lensing, Lutz Dickmann, and Stephane Beauregard

University of Bremen, Bibliothekstrasse 1, 28359 Bremen, Germany{tlensing,dickmann,stephane.beauregard}@tzi.de

http://www.informatik.uni-bremen.de/

Abstract. We present a multi-modal video analysis framework for life-logging research. Domain-specific approaches and alternative software so-lutions are referenced, then we briefly outline the concept and realizationof our OS X-based software for experimental research on segmentationof continuous video using sensor context. The framework facilitates vi-sual inspection, basic data annotation, and the development of sensorfusion-based machine learning algorithms.

Key words: Artificial Intelligence, Information Visualization, HumanComputer Interaction, Visual Analytics

1 Introduction

The Memex vision describes reviewing memories with an interactive workstation,related illustrations show a head-mounted camera [1]. Civil [2, 3] and military re-search [4] alike have since advanced life-logging, a research domain focused on theaggregation of digital personal memories and on their context-based indexing.A motivation for the work reported here is to investigate how continuous digitalvideo (in contrast to still images) from wearables can be segmented into meaning-ful chunks by exploiting contextual information from complementary measure-ment streams—provided, e.g., by audio microphones, GPS receivers, infra-red(IR) or MEMS inertial and pressure sensors.

We outline our implementation of a multi-modal video analysis frameworkfor visual inspection, annotation, and experimental development plus evaluationof pattern recognition algorithms. Our focus here is a fundamental task com-monly not addressed in problem-specific research papers due to scope restric-tions, namely the systematic preparation of an infrastructure that facilitatespractical experiments in a data-intensive domain like life-logging.

A brief state-of-the-art discussion follows that characterizes multi-modal anal-ysis of continuous video in the life-logging domain and references a similar frame-work suitable for the respective set of tasks. We outline our own concept andimplementation in Sect. 2, then discuss our work in Sect. 3.

State of the Art. The SenseCam [5] project is a prominent reference for re-search on sensor-equipped wearables to record daily life memories. A prototyperecords low-resolution still images (in the order of roughly 1,900 per day and

2 Lensing et al.

user) in regular intervals and upon events inferred from sensor data. It logs datafrom an accelerometer and an IR sensor. Structuring still image collections heremeans to find representative keyframes, form groups, and index data by contextcategories [6, 7]. Visual life-logs compiled with this device imply a relatively lowdata rate in comparison to digital video, so still image-centric frameworks arenot per se useful when we aggregate 15 < frames per second. Classical videosummarization techniques [8, 9] again may resort to cut detection in movies forautomatic segmentation—which does not hold for continuous video—and typ-ically disregard additional information streams. Kubat et al. (2007) introducea software tool named TotalRecall1 that facilitiates semi-automatic annotationof large audio-visual corpora [10]. The system aligns recordings from stationarycameras and microphones for semi-automatic labeling. Our focus on experimen-tal algorithm design for more diverse multi-modal logs from wearables may beaccomodated with slight adaptations, the functionality is broadly in line withour requirements and goals. However, the tool is reserved for exclusive researchas it has not been released to the public.

2 Concept and Implementation

We asses that while a coherent framework with the desired characteristics is notpublicly available, most necessary components can be obtained from the opensource/public license communities (but different encodings and distinct work-flows need to be interfaced for practical applications). Our working hypothesisthus is that we can build a research toolkit with the problem-specific function-ality using open software packages that run on a suitable common platform,achieving consistent integration by adequate framework design and data man-agement facilities.

Our development platform of choice is OS X since (a) OS X is fully POSIX-compliant and is among the prime target platforms of open source scientificcomputing projects; (b) the entire system UI is OpenGL-based and hardware-accelerated; (c) native support for high performance linear algebra operationsis given; (d) a complementing focus on interface design issues should acknowl-edge the prevalent role of OS X platforms in design-related disciplines. Theinterdisciplinary origin of our software development efforts is elaborated onin another publication. [12] Our Cocoa-based implementation adheres to themodel/view/controller paradigm and is written in Objective-C/C++, which al-lows seamless integration of libraries in the ANSI C or C++ dialects, in our casemost notably OpenCV (for feature extraction and analysis), FFmpeg (for videodecoding), and the C Clustering Library (for efficient clustering). Python bind-ings allow quick experiments with a wide range of algorithms from the Python-friendly scientific computing community. An example application shown in Fig. 1demonstrates the versatility of the Cocoa-based approach, here we plot spatial

1 Note the (unintended) analogy with our project name AutoMemento which wasdefined independently.

Annotation and Classification of Continuous Video in Context 3

Fig. 1. An already implemented application based on our framework (labeled screen-shot): (a) a cluster inspector, (b) a 3-D collage of detected ‘faces’, (c) a frame inspector,(d) a map inspector, (e) ‘liquid’ videogram [11] with position indicator, (f) several fea-ture timelines with intermediate clustering results with experimental metrics beingtested on HSV, motion vector, and face occurence histograms as well as location; alsodisplaying a meta timeline with merged segments

coordinates via the Google Maps API2. A Decoder reads input data, in the case ofvideo is also accesses raw compressed-domain digital video features (e.g., motionvector fields) and performs color space conversion. The Indexer of our frameworkenables disk-cached random access and encapsulates further feature extractionvia a variety of custom Extractor classes used to calculate distinct problem-specific types of features from the available information streams. A Clustererpartitions the time series with exchangeable similarity metrics and constraintsets. It updates timeline visualizations during processing for live progress indica-tion. Timelines and multi-purpose inspection windows are parts of our OpenGL-based UI system that uses a custom scene graph. The Clusterer is yet to becomplemented by a FlowAnalyzer for time series analysis models like HMMs orecho state networks (ESNs) [13] for dynamical pattern recognition.

3 Discussion and Conclusion

The proposed system suits a pattern recognition-centric research domain likelife-logging in that it facilitates fundamental tasks like training set annotation,feature extraction, and review of raw data with analysis results. Its architectureand bindings accomodate experimental algorithm design for segmentation and

2 The map hack assumes an Internet connection, bridges with Javascript within aWebKit view, and is for demo purposes only. NB: The partially occluded logotypein the research prototype screenshot should read Google Inc.

4 Lensing et al.

classification of multi-modal streams. Time series of data from various sources(GPS, MEMS devices, IR, audio, but also weather data or incoming personalmessages) can be incorporated in the extensible system. The (frame-wise) diskcache-based data management approach allows rapid data access within windowslimited primarily by disk space (we experiment with intervals of approximately30 minutes of continuous video). This places constraints on the time scales thatcan be considered, so super-sampling schemes have yet to be realized for long-term recordings. Even without that, our software engineering approach alreadyrepresents a successful enabling step for a range of experiments in multi-modalvideo analysis. Our current work on dynamical machine learning techniques mo-tivates ESN [13] integration to the framework. Beyond present OpenCV func-tionality, in future research we also seek to add as feature streams the results ofadvanced computer vision techniques. These may exploit GPU hardware acceler-ation for, e.g., fine-grained motion vector estimation, foreground vs. backgroundimage segmentation, structure from motion, or ego-motion estimation.

References

1. Bush, V.: As we may think. The Atlantic Monthly 176(1) (7 1945) NB: Illustrationswith the mentioned head-mounted camera appear in Life Magazine 11, 1945.

2. Mann, S.: Wearable computing: A first step toward personal imaging. IEEEComputer 30(2) (1997) 25–32

3. Billinghurst, M., Starner, T.: Wearable devices: New ways to manage information.Computer 32(1) (1999) 57–64

4. Maeda, M.: DARPA ASSIST (2005) Retrieved from http://assist.mitre.org/. Thisis an electronic document. Published: Feb 28, 2005. Retrieved: Mar 18, 2009.

5. Microsoft Research Sensors and Devices Group: Microsoft Research SenseCam(2009) Retrieved from http://research.microsoft.com/sendev/projects/sensecam/.This is an electronic document. Retrieved: March 18, 2009.

6. Smeaton, A.F.: Content vs. context for multimedia semantics: The case of sense-cam image structuring. In: Proceedings of The First International Conference onSemantics And Digital Media Technology. (2006) 1–10

7. Doherty, A., Byrne, D., Smeaton, A.F., Jones, G., , Hughes., M.: Investigatingkeyframe selection methods in the novel domain of passively captured visual lifel-ogs. In: Proc. Intl. Conference on Image and Video Retrieval. (2008)

8. Lienhart, R., Pfeiffer, S., Effelsberg, W.: Video abstracting. Communications ofthe ACM 40(12) (December 1997) 54–62

9. Ferman, A.M., Tekalp, A.M.: Efficient filtering and clustering methods for temporalvideo segmentation and visual summarization. Journal of Visual Communicationand Image Representation 9(4) (December 1998) 336–351

10. Kubat, R., DeCamp, P., Roy, B.: TotalRecall: Visualization and semi-automaticannotation of very large audio-visual corpora. In: ICMI ’07: Proceedings of the9th international conference on Multimodal interfaces. (2007) 208–215

11. MacNeil, R.: Generating multimedia presentations automatically using TYRO,the constraint, case-based designer’s apprentice. In: Proc. VL. (1991) 74–79

12. Dickmann, L., Fernan, M.J., Kanakis, A., Kessler, A., Sulak, O., von Maydell, P.,Beauregard, S.: Context-aware classification of continuous video from wearables.In: Proc. Conference on Designing for User Experience (DUX07). (2007)

13. Jaeger, H.: Discovering multiscale dynamical features with hierarchical echo statenetworks. Technical Report 10, Jacobs University/SES (2007)

Date post:	29-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

A Framework for Review, Annotation, and Classi …...video segmentation and visual summarization....

Documents