A METHOD AND BROWSER FOR CROSS-REFERENCED VIDEO …aya/pub/icme2002.pdf · 2002-05-23 · A METHOD...

A METHOD AND BROWSER FOR CROSS-REFERENCED VIDEO SUMMARIES

Aya Aner, Lijun Tang, and John R. Kender

Department of Computer ScienceColumbia UniversityNew York, NY 10027

E-mail:�aya, ljtang, jrk � @cs.columbia.edu

ABSTRACT

We present an automatic tool for compact representa-tion and cross-referencing of long video sequences, whichis based on a novel visual abstraction of semantic content.Our highly compact hierarchical representation results fromthe non-temporal clustering of scene segments into a newconceptual form grounded in the recognition of real-worldbackgrounds. We represent shots and scenes using mosaicsand employ a novel method for the comparison of scenesbased on these representative mosaics. We then cluster scenestogether into a higher level of abstraction - the physical set-ting. We demonstrate our work using situation comedies(sitcoms), where each half-hour episode is well structuredby rules governing background use. Consequently, brows-ing, indexing and comparison across videos by physical set-ting is very fast. Further, we show that physical settings leadto a higher-level contextual identification of the main plotsin each video. We demonstrate these contributions with abrowsing tool whose top-level single page displays the set-tings of several episodes. This page expands to display win-dows for each episode, and each episode menu summaryis further expanded into scenes and shots, all by mouse-clicking on appropriate plots and settings according to userinterests.

1. INTRODUCTION

Several tools were suggested in the past for summarizingvideo data. Many systems [1] used a shot-based represen-tation, which is not suitable for long sequences. A shot is asequence of consecutive frames taken from the same cam-era, and a half-hour long video, for example, could contain��

shots. Since in many video genres shots often repeat(dialog shots, for example), using information from eachshot is redundant. This is solved for many structured videogenres by further segmenting shots into scenes [2]. A sceneis a sequence of consecutive shots that have some commonproperty, usually the same physical location. A more com-pact representation of a scene would only use shots that are

visually different [3][4], for example, a single shot wouldbe chosen for every unique camera location.

Most of existing video summary tools rely on key-framesfor visual information and display. However, this represen-tation is lacking since a single frame cannot encompass fullinformation about a shot or a scene: either the backgroundis not fully visible or not all the characters appear in just oneframe. A solution to this problem is using mosaics. A mo-saic represents the full background seen in a shot and couldeither eliminate foreground objects [5] or display their tra-jectories [6]. However, when generating summaries of longvideos for quick previews and fast comparisons, foregroundinformation is less relevant and harder to represent; it ap-pears instead to be more relevant for use when describingscenes within a single video, or for classifying short videosegments in action movies or sports broadcasts according tomotion (that is, ”intra-video” indexing). For the higher datademands of full video compression and searching across acorpus of full videos (that is, ”inter-video” indexing ), wetherefore use ”clear” mosaics where foreground informationis removed when possible.

Having chosen the clear background scene mosaic asthe fundamental visual unit for abstracting high-level non-temporal structure, we next determine which shots shouldbe used to represent a scene or a physical setting. Theircorresponding mosaics are then used to cluster these scenesinto physical settings and to compare physical settings acrossepisodes (videos). We can then distinguish between com-mon physical settings, which tend to repeat in episodes be-longing to the same sitcom, and non-recurring physical set-tings. There are usually � -

�non-recurring physical settings,

and by the rules of screen-writing, they most often representthe main plots in each episode.

The advantage of using mosaics to represent shots isnot confined to well-structured video genres such as sitcomsand other TV programs (e.g., news, drama series). It is alsouseful in sports videos where it is easier to cluster shots us-ing a mosaic-based rather then a key-frame representation[7]. However, in this paper we only present our work on abrowser which exploits the structure of situation comedies.

2. CHOOSING REPRESENTATIVE MOSAICS

Using previously proposed methods based on a model ofhuman visual attention, the video is first segmented intoshots[8], then shots are segmented into scenes[2]. This re-sults in a temporal representation for each episode, with � � -�� scenes each. Since we use a mosaic-based representa-tion, and since each mosaic is generated from a single shot,we need to choose the minimal number of shots to derivethe minimal number of mosaics which fully represent thebackgrounds of their scenes.

2.1. Representing Scenes

We determined three types of shots which are good can-didates for representing the background information of thescene. As a heuristic, we always choose the first shot of ascene since it is often the ”establishing shot” [9], a wide-angle shot photographed for the purpose of identifying thelocation and characters of that scene. Additionally, we heuris-tically also include any pan and zoom shots of a scene. Inpan shots, large portions of background are visible through-out the shot, and in zoom shots, the zoomed out portion alsoexposes large parts of the physical setting.

The process of generating color mosaics is describedin [7] and is based on [10]. Our current method for mo-saic comparison is detailed in [7] and is used to comparescenes and cluster them into physical settings. It is based onrubber-sheet matching, which takes into account the topo-logical distortions among the mosaics, and the rubber-sheettransformations between two mosaics of a same physicalscene. The comparison process is done in a coarse to finemanner. Since mosaics of common physical scenes coverdifferent parts of the scene, we first detect areas in everymosaic-pair which correspond to the same spatial area. Afiner stage verifies corresponding regions and generates fi-nal match values between scenes. A clustering algorithmis employed using the scene-to-scene match values and de-tects clusters of scenes, which are found to correspond tounique physical settings.

2.2. Representing Physical Settings

Although the scene clustering process results typically in� - physical settings, there are � - scenes in each physi-cal setting cluster, and about � - � mosaics representing eachscene. Ideally, for the purposes of display and user inter-face, we would like to choose a single mosaic to representeach physical setting. However, this is not always possible.Shots of scenes in the same physical setting, and sometimeseven within the same scene, are filmed using cameras in var-ious locations which show different parts of the background.Therefore, two mosaics of the same physical setting mightnot even have any corresponding regions.

Instead, we use the results of the matching algorithm’sfiner stage, which recognizes corresponding regions in the

mosaics, to determine a ”minimal covering set” of mosaicsfor each physical setting. We approximate this set (since thereal solution is an NP-hard problem) by clustering all therepresentative mosaics of one physical setting and choosinga single mosaic to represent each cluster. This single mosaicis the centroid of the cluster, e.g., it is the mosaic which hasthe best average match value to the rest of the mosaics inthat cluster. An example for one episode is shown in Fig-ure 1. There are � physical settings and � � scenes, and theirrelations are shown by the colored lines. For display pur-poses, we choose only one mosaic from the largest clusterto represent each setting.

Figure 1: Example summary of one episode. Its semantic repre-sentation is based on � physical settings which are directly relatedto the temporal representation of �� scenes.

The final step is the cross-referencing of episodes fromthe same sitcom. Physical settings across episodes are com-pared and clustered and common settings are determined. Itis often the case that each episode has � -

�common settings

which are identified with the general theme of the sitcom,and � -

�more settings which are often unique to plots of that

episode. Finally, the descriptive textual labelling of physi-cal settings is done manually. Since there are usually only afew common settings for each sitcom (due to economic con-straints on set design and to sitcom writing rules), there areonly � -

�additional settings for every newly added episode.

3. HIERARCHICAL BROWSING TOOL

We have constructed a browsing tool that combines the abil-ity to index a library of videos by both compact semanticrepresentations of videos as well as temporal representa-tions. The compact visual summary enables cross-referencingof different episodes and fast main plot analysis. The tem-poral display is used for fast browsing of each episode.

The main menu is displayed as a table like summary in asingle window. Each row in the table represents one episodeof the specified sitcom. The columns represent different

(a)

(b)

Figure 2: Temporal Vs. Non-Temporal representation of episodes: (a) Single window showing all episodes and physical settings, displayedas a table where each episode has entries (mosaic images) only in relevant settings. This is a very compact non-temporal

¯representation of

a video. (b) By left-clicking on the middle episode ”Friends3”, it is expanded to show its temporal¯

representation of �� scenes; each sceneis represented by a single mosaic.

physical settings that were determined during the cluster-ing phase of scenes for all episodes. Each cell �� in thetable is either empty (setting � does not appear in episode � )or displays a representative mosaic for setting � , taken fromepisode � . The order of the columns from left to right isorganized from the most common to the non-common set-tings. In our example, the first

�columns represent common

settings which repeat in almost every episode of the specificsitcom. The rest of the columns are unique for each episode.In this manner, the user can immediately recognize the mainplots for each episode, by looking for non-empty cells in therow of that episode starting from the fourth column. For ex-ample, for the episode marked as ”Friends 2” in the top rowof Figure 2(a), the main plots involve scenes taking placein settings ”Bedroom1” and ”Bedroom2” (columns � and� from left). In order to confirm the main plots quickly, itis sufficient to left-click on the representative mosaics forthese settings (Figure 3(b)), which displays a window of ashort list of scene mosaics that correspond to those settings(usually one or two), and if further needed, double-clickingon the representative mosaic for each scene will start play-ing the video from the beginning of that scene (Figure 3(d)).

The temporal representation of each episode is also ac-cessed from the main menu and is used for fast browsing ofthat episode. By left-clicking on a certain episode name, asshown in Figure 2(b), a window of a list of all scene mo-saics belonging to that episode appears. Each scene on thatlist is represented by a single mosaic and it is optionally ex-panded by left-clicking into a window of a list of represen-tative mosaics (shots) for that scene (Figure 3(c)). The fastbrowsing is performed by scanning the scenes in order andonly playing relevant video segments from chosen scenesby double-clicking on them, as shown in Figure 3(d).

Our browser has the advantage of being both hierarchi-

cal in displaying semantically oriented visual summaries ofvideos in a non-temporal tree-like fashion as well as seman-tically relating different episodes of the same sitcom to eachother. We have tested its usefulness by analyzing feedbackfrom several subjects who follow the sitcom. We testedwhether the temporal scene representation allowed mean-ingful fast browsing, as well as whether they were able torecognize the main plots of each episode using our menus,and how fast they performed. The feedbacks were encour-aging: main plots of familiar episodes were recognized within� -�

minutes, which included first browsing the scenes fora general impression of temporal flow, and then by click-ing on non-common settings and viewing the video. Themost interesting result was that almost all of the main plots(except one) were recognized without playing the corre-sponding audio - only the video was played. This provedthe highly semantic value of the physical setting in summa-rizing and then recalling a video.

4. DISCUSSION AND CONCLUSION

We presented a compact browsing tool which allows fastcomparison of relatively long video sequences, and demon-strated it using episodes from a sitcom. Our approach ofcompact summaries of videos exposes the videos’ seman-tic structure. Many TV series have similar semantic struc-tures, and we believe that our approach could be furtherexpanded for browsing feature movies. It is also in accor-dance with the manner in which news clips are segmentedinto anchor shots and story segments [11]. Although notpresented here, mosaic-based representation and compari-son were also demonstrated to be useful in summarizing anddetecting interesting events in sports sequences [7], such asbasket goals.

(a)

(b)

(c)

(d)

Figure 3: Example menus of browsing tool. (a) Right-click on a single mosaic (first row, second from left) to enlarge it. (b)Left-click on asingle mosaic (leftmost mosaic on first row, representative mosaic for setting ”Apartment1” of episode ”Friends3”) to expand that physicalsetting, showing all scenes of that setting. (c) By left-clicking on second mosaic from left in the menu shown in (b), scene #3 is chosen andits representative mosaics are enlarged and displayed. (d) By double-clicking on the third mosaic from left in the menu shown in (b), themovie clip of scene #7 starting from the first shot of that scene is displayed.

5. REFERENCES

[1] P. Aigrain, H.J. Zhang, and et al., “Content-based repre-sentation and retrieval of visual media: A state-of-the-art re-view,” in Multimedia Tools and Applications. 1996, Kluweracademic publisher.

[2] J. R. Kender and B.L. Yeo, “Video scene segmentation viacontinuous video coherence,” in Proceedings of IEEE CVPR,1998.

[3] M. Yeung and B.L. Yeo, “Time-constrained clustering forsegmentation of video into story units,” in ICPR, 1996.

[4] S. Uchihashi, J. Foote, A. Girgensohn, and J. Boreczky,“Video manga: Generating semantically meaningful videosummaries,” in Proceedings of the ACM Multimedia, 1999.

[5] M. Irani and P. Anandan, “Video indexing based on mosaicrepresentations,” in Proceedings of the IEEE, 1998, vol. 86.

[6] M. Gelgon and P. Bouthemy, “Determining a structuredspatio-temporal representation of video content for efficientvisualisation and indexing,” in Proceedings of ECCV, 1998.

[7] A. Aner and J. R. Kender, “Video summaries throughmosaic-based shot and scene clustering,” in Proceedings ofECCV, 2002.

[8] A. Aner and J. R. Kender, “A unified memory-based ap-proach to cut, dissolve, key frame and scene analysis,” inProceedings of IEEE ICIP, 2001.

[9] D. Arijon, Grammar of the Film Language, Silman-JamesPress, 1976.

[10] M. Irani, P. Anandan, J. Bergenand R. Kumar, and S. Hsu,“Efficient representation of video sequences and their appli-cations,” in Signal processing: Image Communication, 1996,vol. 8.

[11] S.-F. Chang, Q. Huang, T. Huang, S. Purl, and B. Shahraray,“Multimedia search and retrieval,” in Advances in Multi-media: Systems, standards and Networks. 1999, New York:Marcel Dekker.

Date post:	02-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

A METHOD AND BROWSER FOR CROSS-REFERENCED VIDEO …aya/pub/icme2002.pdf · 2002-05-23 · A METHOD...

Documents