Sequence Matching Based Automatic Retake Detection ... · PDF fileKeywords: rushes video,...

Chiang Mai J. Sci. 2015; 42(4) 1005

Chiang Mai J. Sci. 2015; 42(4) : 1005-1018http://epg.science.cmu.ac.th/ejournal/Contributed Paper

Sequence Matching Based Automatic RetakeDetection Framework for Rushes VideoNarongsak Putpuek [a], Nagul Cooharojananone*[a], Chidchanok Lursinsap [a] and

Shin’ichi Satoh [b]

[a] Advanced Virtual and Intelligent Computing Center, Department of Mathematics and Computer Science,

Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand.

[b] National Institute of Informatics (NII), Tokyo, 101-8430 Japan.

*Author for correspondence; e-mail: [email protected]

Received: 11 June 2013

Accepted: 11 June 2014

ABSTRACT

Automatically selecting the important content from rushes video is a challenging taskdue to the difficulty in eliminating raw data, such as useless content and redundant content.Redundancy elimination is difficult since repetitive segments, which are takes of the samescene, usually have different lengths and motion patterns. In this work, a new methodology isproposed to determine retakes in rushes video. The video is divided into shots by the proposedautomatic shot boundary detection using local singular value decomposition and k-meansclustering. Shots that contain the useless contents were eliminated by our proposed techniqueintegrated a near duplicated key frame (NDK) algorithm. The local features of each remainingframes were extracted using the scale-invariant feature transform (SIFT) algorithm. The similaritybetween consecutive frames was calculated using SIFT matching and then converted into astring. The given strings were then concatenated into a longer string sequence to use as a shotrepresentation. The similarity between each pair of sequences was evaluated using the longestcommon subsequence algorithm. Our method was evaluated in direct comparison with theconventional technique. Overall, when evaluated across the TRECVID 2007 and 2008 datasets that represent diverse styles of rushes videos, the proposed methodology provided ahigher degree of accuracy in the detection of retakes in rushes videos.

Keywords: rushes video, sequence matching, retake detection, SIFT, LCS, SVD

1. INTRODUCTION

Advances in digital technology, such asdata acquisition, storage and communicationtechnologies, have progressed significantlyresulting in an increase in the availability ofvideo data at an exponential rate. In post-production management of digital cinemacontents, a large amount of the raw video

(rushes video) needs to be viewed andorganized with respect to the importance ofthe content required in the final version.In general, rushes videos are the raw videosrecorded from a film making production.The basic structure of a rushes video is thatthe same scene is recorded several times

1006 Chiang Mai J. Sci. 2015; 42(4)

according to the script at the directorcommands and includes various differentsettings and unexpected mistakes betweeneach take [1]. Therefore, there are two typesof content in the rushes video: useless contentand redundant content. The useless content isthe shots that are irrelevant to the main contentof the video, such as color bars, single colorframes and clapper boards. The redundantcontent (usually called a retake) is a set ofrepetitive shots with the same or near-identicalsetting. Figure 1 shows an example of a retake

which consists of two takes from videoMRS151585. Although it is necessary to detectretakes and select the best take as theimportance content, automatically selectingthe importance content from the rushes is achallenging task due to the difficulty ofeliminating raw data, such as color bars, singlecolor frames, clapperboards and redundantcontent. Redundancy elimination is difficultsince repetitive segments, which are takes ofthe same scene, usually have different lengthsand motion patterns.

Figure 1. Example of a retake (from video MRS151585), showing two takes that have anear-identical setting, and a common subsequence of specific object patterns.

Several approaches have been proposedfor the detection of redundant contents invideos. Linjin [2] focused on detecting andremoving the redundant frames using thehierarchical agglomerative cluster (HAC) andthe Smith-Waterman algorithms to producethe video summary. For matching commercialfilm clips, the binary signatures and a simpledistance matching method has been evaluated[3]. However, these conventional detectionmethods for redundant frames do notprovide satisfactory levels of efficiency forrushes video, due to the significant differencesbetween videos. Moreover, these videos areunedited and contain useless as well asredundant content.

Recently, several methods to detectretakes in rushes videos have been proposed[4-8]. In the first step of automatic video

processing systems, the continuous videosequences were usually segmented into shotsof basic video units. Automatic shot boundarydetection (SBD) was then applied using pixeldifferences within a given threshold [8],histogram differences with a given threshold[4] or an adaptive threshold [5], and color-texture with the analysis of temporal slices[6]. Key-frame selection methods were thenemployed to extract the representative shotsor sub- shots [4-8]. In order to detect retakesin rushes video, the HAC algorithm was usedbased on sub-shots in the form of theiraverage on a local histogram [4]. Liu et al. [7]proposed a multi-state clustering algorithmbased on key-frames of the sub-shot, whilstdetection of the retake using key-frame andspeech transcript comparisons based on adirected graph has been reported [6].

Chiang Mai J. Sci. 2015; 42(4) 1007

However, these methods are limited becausethe shot or sub-shot representative isdependent on the key-frame selectionmethods. Moreover, shots with a low numberof key-frames cannot provide a high enoughresolution (efficiency) for the clusteringmethod.

Another approach that has beenproposed to detect retakes is based onsequence matching [10]. A distance measureapproach based on the longest commonsubsequence (LCS) algorithm was presented.Here the rushes video was decomposedinto segments using SBD based on a support

vector machine (SVM) classifier. Thesimilarities for all shots were computedusing the LCS algorithm and then twoshots were merged if the value of theirshot similarity was larger than thepredefined threshold using single linkageclustering. Although this method has agood performance for detecting retakes inrushes videos, it cannot differentiate betweenscenes with only a little action and the samebackground, such as in the same room.This is because the difference in visualinformation and activity between the scenesis very small.

Figure 2. Proposed framework.

In this work, a method to detect retakesin rushes video using (1) the characteristics ofthe video sequence, (2) the object recognitionmethod based on the scale-invariant featuretransform (SIFT) algorithm, and (3) the retakeand the LCS algorithms was proposed tohandling the limitation of the previous work.This method uses the characteristics of the

video sequence that can be represented as asequence of objects. In rushes video, a retakeis essentially a set of repetitive shots with thesame or near-identical setting, and typically hasa common subsequence of specific objectpatterns (see Figure 1.). To detect a commonsubsequence, the sequence of object locationwas encoded into a sequence of string patterns

1008 Chiang Mai J. Sci. 2015; 42(4)

using the object recognition method basedon the SIFT feature and the grid method.Then, the LCS algorithm was used to finda common subsequence, as was recentlyreported to provide the best performancewhen used for matching sequences [11].Utilizing the fact that each retake has two ormore takes that appear as a sequence, thenretakes were detected by finding a commonsubsequence in the string pattern based onsimple retake detection and LCS algorithms.The framework of this proposed methodwas designed as four steps (see Figure 2). Inthe first step, the rushes video was dividedinto a set of frames called shot using theproposed automatic SBD based on singularvalue decomposition (SVD) and k-meansclustering. Then, in the second step, a shotthat contains a single color, color bars, orclapper boards was eliminated by a newlyproposed algorithm and the near duplicatedkey-frame (NDK). In the third step, the localfeature of each remaining frame in the shotwas extracted using the SIFT algorithm.The similarity between consecutive frameswas calculated using SIFT matching andconverted into a string. All strings were thenconcatenated into a string sequence to use asa shot representative. Finally, in the fourth step,the similarity between two sequences wasevaluated using the LCS algorithm, where asimple algorithm was performed to detectthe retake using its defined characteristics.

2. SHOT BOUNDARY DETECTION (SBD)

In order to eliminate the redundantforms in a rushes video, it must first beorganized into a compact form to extractsemantically meaningful information. Allrushes are unedited [1] and, therefore, theyconsist of hard cuts only. SBD is the mostbasic technique wildly used as the first stepto organize the video data into segments.Although a fairly diverse array of approaches

have been proposed for SBD, the simplestapproach is to determine the differencebetween the consecutive frames based ontheir global features, such as color histogramsor pixel intensity [13]. Alternatively, anothermethod used the local features by dividingeach input frame into blocks [4], and athreshold value was then used to classifythe hard cuts. These comparable thresholdbased methods are fast and simple [13], butcannot differentiate between the hard cutand the motion of large objects. Moreover,it is difficult to achieve an equal efficiencyand detection level across new videos.Accordingly, this problem can be overcomeusing the clustering method. Cernekova et al.[18] proposed a technique based on the SVDand unsupervised clustering. SVD was appliedon the color histogram and then eachcandidate shot was determined using thestatic threshold between two consecutiveframes. The candidate shots were thenmerged using a hypothesis test between twoconsecutive shots. This method gave a goodperformance in terms of the recall andprecision, but was not suitable for rushesvideo due to the fact that it will merge theredundant content into one shot.

Recently, several studies have focused onthe use of SVD for video sequence matching[12]. By considering the input image as thematrix A, it can be factored into a singularvalue and the corresponding singular vectorfactorization. These values are usefulinformation for discriminating image patternsor contents. However, a global feature is verysensitive to motion, such as that of a largeobject, and a relatively high level of cameramovement. A comparison of several SBDand classification techniques revealed that alocal feature gave a higher performance andhad the advantage of being insensitive to largeobject motion and high camera movement[13]. Therefore, the local SVD features were

Chiang Mai J. Sci. 2015; 42(4) 1009

used as the frame representative in this study.Typically, the most widely used methods

to detect a shot transition have been basedon similarity measurements between twoadjacent frames, where a threshold-basedapproach can be used to achieve this. Theadvantage of this type of method is that it isfast. However, a threshold-based approachcannot achieve an equal detection efficiencyacross different video data. Unsupervisedclustering can be used to overcome thisproblem and so in this study a simple k-meansclustering approach to detect the shot transitionwas employed. Accordingly, in this work, weproposed a new method for hard cut SBDbased on SVD and k-means clustering.

Figure 3. An example of a frame that isdivided into 4 × 4 blocks with the layout oflocal feature.

2.1 Local Feature ExtractionIn order to extract a local SVD feature,

let frame ft be an input video frame and each

frame ft be divided into B × B blocks. Figure

3 shows an example of a frame that is dividedinto 4 × 4 blocks. Let A be a M × N matrixof the bth block. The SVD of matrix A isthen factored from A=UWVT, where U is aM × r column orthogonal matrix, V is a N ×r column orthogonal matrix, andW=diag(σ

1,…,σ

r) is a diagonal matrix for

r=min(M,N). These diagonal elements arecalled the singular values (SV). Thus, each SV

vector can be used to describe each block bof frame f

t effectively. The SV vector for

frame ft is defined asW

b,t = [σ

1σ

2 ... σ

r]T, where

Wb,t

is the SV vector of block b for 1≤b≤B2,and the notation W

b,t[r] represents the value

of the rth element of the SV vector.

2.2 Feature Similarity MeasureIn order to detect a shot transition

between two adjacent frames based on thelocal features, an appropriate similaritymeasurement is used. Euclidean distance isprobably the most commonly used similaritymeasurement for numerical data, where thesimilarity between the bth blocks of frames f

t

and ft+1

can be defined as in Eq. (1);

Dsim

(b) = √Σ (Wb,t

[i]-Wb,t+1

[i])2 (1)

where Dsim

(b) is the similarity of the SVvector between the bth block of frames f

t and

ft+1

.

2.3 Shot ClusteringWe define the boundary types into the

two classes of (1) a normal boundary and (2)a cut boundary. A normal boundary is theboundary between two adjacent frames thathave the same or nearly the same visualinformation, whereas a cut boundary is theboundary between two adjunct frames thathave different visual information. In addition,the number of cut boundaries will be less thanthe number of normal boundaries for eachvideo. Taking this definition into account, wecan classify a given feature similarity betweeneach two adjacent frames by using k-meansclustering with k = 2.

In order to classify a shot boundary, thegiven Dsim

(b) values were sorted into ascendingorder. Let b′ denote the region index aftersorting so that D

sim(b′) ≤ D

sim(b′+1). Then, for

each two adjacent frames, these values wereobtained by sorting their region. With regard

n

i = 1

1010 Chiang Mai J. Sci. 2015; 42(4)

to the problem caused by the motion of alarge object and/or quick camera movement,these can be solved by removing the largevalues of D

sim(b′). Accordingly, the values of

Dsim

(b′) for clustering are defined as in Eq.(2);

x = Σ Dsim

(b′) (2)

where θavoid

is the number of Dsim

(b′) thatare needed to be removed. The input vectoris defined as in Eq. (3);

X = [x1x

2 ... x

d]T (3)

where d is the number of dimensions ofX. Then, k-means clustering with k=2 wasapplied to X to classify the shot boundary.Thus, the set of input vectors X was dividedinto two clusters, where the cut boundary isthe cluster that has the lower number ofvectors.

3. JUNK ELIMINATION

In order to reduce the computationaltime for feature extraction, the junk shotsneed to be identified and eliminated. Weapplied the algorithm in [14] which achievedan accurate result and performance in termsof the lack of junk on the TRECVID rushesvideo for this work.

4. FEATURE EXTRACTION AND RETAKE

DETECTION

4.1 Local Feature ExtractionLet V={S

1,S

2,…,S

n} be a set of shots,

where n is the number of shots. Each shot isrepresented by a set of key-frames. In orderto reduce the computation time, the key-frames were extracted from the original videoat every 10th frame. Each frame was then

B2 - θavoid

b′ = 1

divided into m=Bx×B

y blocks. From our

evaluations (data not shown), Bx=5 and B

y=5

provided the best result. Let F denote theSIFT feature for the k-th block of frame f

t.

The SIFT features, extracted by the methodof [15], are represented by the vector ν =[d σ x y ]T, where 1 ≤ j ≤ , N and Mare the video frame width and heightrespectively, d = (∈ , ..., ∈ ) is the 128dimensional SIFT descriptor, σ is the scaleof SIFT features, and x and y are theSIFT feature locations. The similarity betweentwo SIFT features was determined using onlythe SIFT descriptor tf

jd [16]. Then, each Fk

of each block is defined as in Eq. (4);

F ={d |d = (∈ , ..., ∈ )}(4)

The F of each block was extracted bythe method of [15].

In order to select the k-th blocks as theobject location representative, SIFT matching[16] can be used to find the similarity betweenthe k-th blocks of frames f

t and f

t+1. Thus, the

similarity between SIFT feature and is definedas in Eq. (5);

(5)

where τ is a distance ratio between thefirst nearest and the second nearest neighbors,and is used to discard incorrect matches. Then,the similarity in the same block between frameft and f

t+1 was evaluated (Algorithm 1). An

example is shown in Figure 4.

ftk

ftj

ftj

ftj

ftj

NMm

ft128

ftj

ftj

ftj

ftj

ftj

ftk

ftk,j

ftk,j

ftk,j,1

ftk,j,128

ftk

Chiang Mai J. Sci. 2015; 42(4) 1011

To select k-th blocks as the objectlocation representative, a threshold basedapproach was performed as follows. Let θ

select

denote the threshold to select the k-th blocksand β be the similarity result of D(F ,F ). The threshold θ

select is then defined as

in Eq. (6);

θselect

= α Σ β (6)

where α is a constant. If the value ofβ was over the threshold value θ

select, then

the block at k-th was selected (see Figure 4).

ft, ft+1k

ftk

ft+1k

ft, ft+1k

m

k = 1m

ft, ft+1k

Figure 4. An example of the SIFT matchingresult.

Figure 5. A frame is divided into 5×5 blocksand a set of English alphabet characters areassigned.

In order to encode the set of selectedfeatures into a string sequence, the stringrepresentation approach was performed.Let a set of English alphabet characterscorrespond to the grid blocks (see Figure 5).The sequence of the string was thendetermined by matching the index of theselected features (For an example see Figure6). Let Z=<z

1,z

2,…,z

m> be the sequence of

the string that is encoded from frame ft and

frame ft+1

by the proposed method, wherez

m= the English alphabet character that

corresponds to the blocks at k-th, m = Bx×B

y.

Let Pn be the sequence of the string of shot

Sn. Then, the sequence of string P

n is given as

Pn=Z

1Z

2…Z

l , where l=(h-1), and h is the

number of key-frames in shot Sn.

Figure 6. An example of encoding fiveframes into a string sequence.

4.2 Retake DetectionAs previously mentioned, a rushes video

consists of retakes. It is essentially a set ofrepetitive shots with the same or a near-identical setting, and typically has a commonsubsequence of specific object patterns. Todetect a common subsequence, the sequenceof object locations is encoded into a sequenceof string patterns by using the objectrecognition method based on the SIFT featureand the grid method. Since each retake has

1012 Chiang Mai J. Sci. 2015; 42(4)

two or more takes that appear as a sequence(see Figure 7), then retakes can be found byfinding a common subsequence in the stringpattern based on simple retake detectionand LCS algorithms. In order to detect therepeated take in the same scene, the retakedetection algorithm (Algorithm 2) was thenimplemented.

Figure 7. An example of takes that appear intwo scenes.

The threshold θretake

from Algorithm 2 isdefined as Eq. (7);

θretake

= ωmin(Li , L

j ) (7)

where Li is the length of sequence that is

extracted from Shoti , L

j is the length of

sequence that is extracted from Shotj, and ω is

a constant.

5. EXPERIMENTAL RESULTS

5.1 Data SetsFor shot boundary Detection, all

evaluations of the SBD were performed andevaluated on the TRECVID 2004 and 2007data sets. The videos are in MPEG-1 formatwith a frame rate of 29.97 fps and a frame

size of 352×288 pixels. The ground truthprovided by TRECVID was used forevaluating the results.

For retake detection, the TRECVID2007-2008 rushes video summarization dataset [1] were used for the evaluation of theretake detection. Eight of these data sets wereselected for testing and evaluation. The videoformat, frame rate and frame size were set asabove.

5.2 Performance EvaluationPerformance evaluation was

characterized as the proportion of all shotboundaries (retakes) that were identified(Recall), the level of precision and from thesethe F1 ratio, as shown in Eq. (8);

Recall = , Precision = ,

F1 = (8)

where tp (true positive) is the number ofcorrectly retrieved shot boundaries or retakes,fn (false negative) is the number of missedretrieved shot boundaries or retakes and fp(false positive) is the number of false retrievedshot boundaries or retakes.

5.3 Shot Boundary Detection (SBD)5.3.1 Comparison for number of blockselection

The performance of the number ofblock selections with respect to providingthe best result for SBD was evaluated usingthe TRECVID 2007 data set of eight videos.In order to compare the performance withdifferent numbers of block, the input videoswere divided into B × B blocks where B ={2, 3, 4, 5, 6, 7, 8, 9, 10}. The local SVDfeature was then extracted from each block.The similarity between each two adjacentframes was then determined using Euclideandistance. The set of input vectors, created using

Recall + Precision2 × Recall × Precision

tp + fn tp + fptp tp

Chiang Mai J. Sci. 2015; 42(4) 1013

Eqs. (2) and (3) (section 2.3) with θavoid

= 0,and k-means with k = 2, was then applied toclassify the shot boundary. The results of theperformance comparison are shown in Table1.

From the results (Table 1), block sizesof 8 × 8 and 10 × 10 provided a higher recallvalue than the other block sizes, whilstblock sizes of 2 × 2, 4 × 4 and 5 × 5 provideda higher precision value than the blocksizes of 8 × 8 and 10 × 10. Thus, a smallblock number missed detecting somedifferences because it cannot differentiatebetween shot boundaries that have similarvisual contents. However, with a largenumber of blocks false detection occurredsince it could not differentiate betweenhard cuts and the motion of large objects orrelatively quick camera movement. Therefore,in this work, a block size of 8 × 8 was chosensince it gave a higher recall and precisionvalue than a 10 × 10 block size.

5.3.2 Comparison of the threshold valuefor large similarity value removal

From the previous evaluation (section5.3.1), it was found that a large number ofblocks could not differentiate between hardcuts and the motion of a large object orrelatively quick camera movement. This canbe overcome by removing frames with a

large similarity value in the local SVDfeature between two adjacent frames. Thus,the performance of different thresholdvalues on a 8 × 8 block, in terms of providingthe best result for SBD, was evaluated whenframes with a large similarity value in thelocal SVD feature were manually removed.The local SVD feature was then extractedfrom each block and the similarity betweeneach pair of adjacent frames was determinedusing the Euclidean distance. The set of inputvectors were created using Eqs. (2) and (3)with the threshold parameter (θavoid

) varyingfrom 0 to 63. Then, k-means with k = 2 wasapplied to the set of input vectors toclassify the shot boundary.

From the results, a low thresholdparameter θ

avoid value gave the highest average

recall and precision. Because the motion of alarge object or quick camera movement canresult in very different intensity distributionsfor those blocks affected by them,their removal reduces this impact. In contrast,high threshold values θ

avoid yielded a high

precision but a low average recall value,since too many shot boundaries were rejected.

Thus, in this work, a parameter θavoid

value range of 5 to 9 was selected as acompromise between the recall and precisionperformances.

Table 1. Performance comparison for different numbers of blocks in the block selection.

Video name

BG_2408BG_37359BG_35050BG_36028BG_37417BG_35187BG_36537BG_37879Average

Recall2×20.800.850.870.920.930.740.850.910.86

3×30.950.900.930.920.970.790.900.940.91

4×40.990.950.960.970.990.870.910.940.95

5×50.990.960.960.990.990.890.920.950.96

6×60.990.960.970.991.000.950.920.980.97

7×70.990.960.960.990.990.930.920.970.96

8×81.000.980.970.991.000.950.920.990.98

9×90.990.970.960.990.990.950.930.990.97

10×101.000.980.970.990.990.950.930.990.98

1014 Chiang Mai J. Sci. 2015; 42(4)

Table 1. Continued.

5.3.3 Comparison for performanceThe SVM algorithm of Le [17] was

previously found to provide the bestperformance in cut detection on theTRECVID 2003 data set and so was selectedfor comparison with this proposed methodto evaluate the SBD performance, using theparameters previously recommended in[17] on the TRECVID 2004 and 2007 datasets of 12 videos. The standard SVM of Le[17] was trained using the four videos fromthe TRECVID 2004 dataset, and then both

methods were assessed on the eight videosfrom the TRECVID 2007 dataset. For theproposed method, the input videos weredivided into 8 × 8 blocks, the local SVDfeatures were extracted from each blockand the similarity between each adjacentpair of frames was determined using theirEuclidean distance. The set of input vectorswere created using Eqs. (2) and (3) withθ

avoid = 5. Then, k-means with k = 2 was

applied to the set of input vectors to classifythe shot boundary.

Video name


F12×20.860.910.920.860.950.840.890.950.90

3×30.940.930.950.860.960.880.900.970.92

4×40.960.950.960.920.980.930.900.970.95

5×50.960.940.970.940.960.940.910.970.95

6×60.970.930.970.930.960.960.900.990.95

7×70.970.930.970.920.950.950.910.980.95

8×80.980.940.970.910.950.960.900.990.95

9×90.980.920.970.920.940.960.910.990.95

10×100.980.910.970.910.940.960.900.990.95

Video name


Precision2×20.940.990.980.810.970.970.930.990.95

3×30.940.970.970.810.950.990.901.000.94

4×40.930.950.970.880.970.990.901.000.95

5×50.930.930.980.890.940.990.901.000.95

6×60.960.900.970.870.920.980.891.000.94

7×70.950.900.980.860.910.980.901.000.94

8×80.960.900.980.850.900.980.891.000.93

9×90.960.880.980.860.900.970.881.000.93

10×100.960.860.980.850.900.970.871.000.92

Chiang Mai J. Sci. 2015; 42(4) 1015

Table 2. Performance comparison of the proposed method with that of the SVM basedmethod of Le [17].

Video name


Proposed Method Le [17]

Recall0.950.950.910.970.990.890.900.980.95

Precision0.980.930.990.970.971.000.940.980.97

F10.970.940.950.970.980.940.920.980.96

Recall0.920.980.950.870.950.850.960.780.91

Precision0.951.000.920.790.740.910.990.990.91

F10.930.990.930.830.830.880.970.870.90

From the results, the proposed methodwas found to have a similar recall rate to themethod of Le[17], but overall it gave a muchbetter precision rate (Table 2). For videos BG36028, BG 37417 and BG 37879, theproposed method had a better recall andprecision rate, whilst for videos BG 37359and BG 36537, the standard SVM methodof Le[17] had a better recall and precisionrate. For video BG 35187 a low recall ratewas seen with the proposed method, but itgave a better precision rate. This is becausethis video has similar backgrounds and visualcontent in some adjacent shots. On the otherhand, the SVM method of Le [17] had a lowrecall and precision rate. Due to the methodof Le [17] is very sensitive to small changes.For video BG 37879, the method of Le [17]yielded a precision rate similar to the proposedmethod, but with a lower recall rate. Due tothis method cannot detect an adjacent shotthat has a blurred background. Likewise thefalse detections by the proposed method,revealed that it can be sensitive to changes inthe image intensity. However, the proposedmethod was found to have an overallperformance close to the method of Le [17].

5.4 Retake Detection5.4.1 Experiment setup

The performance of the proposedmethod and the existing method of Bailer[10] were compared when manually andautomatically extracting the video shotboundary. The method of Bailer [10] wasselected as the reference for comparison,using the recommended parameter settingsfor the latter [10], since it provided the bestretake detection performance with theTRECVID 2007 rushes video summarizationdata set. For the manual extraction of thevideo shot boundary, the input videos weredivided into segments using manuallyextracting the shot boundary. The set ofground truth were manually identified. For theautomatic extraction of the video shotboundary, the input videos were divided intosegments using the proposed automatic SBDwith a block size of 8 × 8 and θ

avoid= 5. Then,

k-means with k = 2 was applied to classifythe shot boundary. The procedure for theproposed method was to first extract thekey frame from each shot at every 10th frameand divide them into 5 × 5 blocks and extractthem using the SIFT feature. The constant α

1016 Chiang Mai J. Sci. 2015; 42(4)

is set to 1.8 (Section 4.1, Eq.(6)). The similaritybetween consecutive frames was calculatedusing SIFT matching and converted intoa string. The given strings were thenconcatenated into a string sequence to use as

the shot representative and the similaritybetween any two sequences was evaluatedby the LCS algorithm. The constant ω is setto 0.45 (Section 4.2, Eq.(7)). Algorithm 2 wasthen performed to detect the retake.

Table 3. Performance comparison of retake detection with manual shot boundary detection(SBD).

Video name

MRS144760MRS145342MRS147040MRS151585MRS150072MRS025913MRS044500MRS145918

Average

Proposed Method Bailer [10]

Recall0.831.000.781.000.780.860.800.830.86

Precision0.831.000.881.001.001.001.001.000.96

F10.831.000.821.000.880.920.890.910.91

Recall1.000.200.900.830.890.630.800.670.74

Precision0.860.170.820.831.000.631.001.000.79

F10.920.180.860.830.940.630.890.800.76

5.4.2 Results of retake detection withthe manual SBD

Analysis of the videos MRS151585 andMRS145342 revealed perfect matchingresults (Table 3), which reflected the factthat in this video only a few objects (Actors)appeared in a scene and, in addition, themotion magnitude in the video was also low,which increased the SIFT feature matchingresult. On the other hand, the method ofBailer [10] yielded a low recall and precisionlevel due to the fact that this algorithmcreated the take candidate by using pairwisematching of shots and then merging togetherthose that had the same visual content were,

and so could not detect these differences.Analysis of video MRS150072 produced alow recall level with this proposed method,probably because the retakes in the samescene had a different duration and socontained different amounts of informationbetween takes within the same scene. In thisscenario, the differences in the duration ofeach take would produce a different lengthstring and, therefore, the number of matchedstrings found by the LCS algorithm werelower. Overall, the proposed method has abetter performance in detecting a retakethan that of Bailer [10].

Table 4. Performance comparison of retake detection with automatic shot boundary detection(SBD).

Video name

MRS144760MRS145342


Recall0.831.00

Precision0.831.00

F10.831.00

Recall1.000.20

Precision0.860.17

F10.920.18

Chiang Mai J. Sci. 2015; 42(4) 1017

Table 4. Continued.

Video name

MRS147040MRS151585MRS150072MRS025913MRS044500MRS145918

Average


Recall0.781.000.780.630.800.830.83

Precision0.881.001.000.711.001.000.93

F10.821.000.880.670.890.910.88

Recall0.900.830.890.380.800.670.71

Precision0.820.831.000.501.001.000.77

F10.860.830.940.430.890.800.73

5.4.3 Results of retake detection withautomatic SBD

The results of the performancecomparison are shown in Table 4. For videosMRS144760, MRS145342, MRS147040,MRS15185, MRS150072, MRS044500 andMRS145918, the proposed method andthat of Bailer [10] gave essentially the sameresults as when the shot boundary wasmanually extracted (section 5.4.2; Table 3) andwere close to the ground truth. Analysis ofvideo MRS025913 gave a low accuracywhen compared with the results obtained bymanual extraction, which is due to the largerdifference from the ground truth and thatsome retakes had no stop code insertedbetween consecutive take and so some retakeswere not detected. However, the averagerecall, precision and F1 levels of the proposedmethod are higher, and so it shows a betterperformance in detecting retakes than that ofthe method of Bailer[10].

6. CONCLUSION AND FUTURE WORK

In this manuscript, a new approach todetect the presence of retakes in rushes videosbased on matching a string sequence encodedfrom the location of the object was presented.The framework of this proposed method wasbased upon four steps. First, the input videowas decomposed into shots using a local SVD

and k-means clustering. Second, useless shots,such as color bars, single colors, very shortshots and clapper boards, were removed.Third, the key-frames were then extractedfrom each shot at every 10th frame andencoded into a string sequence using thelocation of the object based on the SIFTfeatures. Finally, the LCS and simple algorithmwere executed to detect the presence of anyretakes. From the experimental results, theproposed method has a better performancein detection of retakes than the existing retakedetection algorithms. Future works includesmore features such as object trajectories,motion information in order to solve theproblem and produce a high accuracy result.

ACKNOWLEDGEMENTS

This research was supported by theThailand Research Fund (TRF) andCommission on Higher Education (CHE).The Rushes video files are copyrighted andare provided for research purposes throughthe TREC Information Retrieval ResearchCollection, with thanks.

REFERENCES

[1] Over P., Smeaton A. and Awad G.,Proceedings of the 2nd ACM TRECVid VideoSummarization Workshop(TVS ’08), 2008;1-20. DOI 10.1145/1463563.1463564.

1018 Chiang Mai J. Sci. 2015; 42(4)

[2] Lijin Z., Int. J. Adv. Comput. Technol., 2011;3(5):161-169.

[3] Li Y., Jin J.S. and Zhou X., Proceeding of2005 International Symposium on IntelligentSignal Processing and Communications Systems(ISPAC 2005), 2005; 317-320. DOI 10.1109/ISPACS.2005.1595410.

[4] Pan C.M., Chuang Y.Y. and Hsu W.H.,Proceeding of the International Workshopon TRECVID Video Summarization(TVS ’07), 2007; 74-78. DOI 10.1145/1290031.1290045.

[5] Truong B.T. and Venkatesh S., Proceedingof the International Workshop on TRECVIDVideo Summarization(TVS ’07), 2007;30-34. DOI 10.1145/1290031.1290036.

[6] Wang F. and Ngo C.W., Proceeding of theInternational Workshop on TRECVID VideoSummarization(TVS ’07), 2007; 25-29.DOI 10.1145/1290031.1290035.

[7] Liu Z., Zavesky E., Shahraray B.,Gibbon D. and Basso A., Proceeding the2nd ACM TRECVid Video SummarizationWorkshop(TVS ’08), 2008; 21-25. DOI10.1145/1463563.1463565.

[8] Bai L., Hu Y., Lao S., Smeaton A.F.and O’Connor N.E., Multimedia Tools andApplication, 2009; 49(1): 63-80. DOI10.1007/978-3-540-92235-3_3.

[9] Emilie D. and Bernard M., MultimediaTools and Applications, 2010; 48(1): 51-68.DOI 10.1007/s11042-009-0374-9.

[10] Bailer W., Lee F. and Thallinger G.,The Visual Computer, 2009; 25(1): 53-68.DOI 10.1007/s00371-008-0280-6.

[11] Bailer W., Proceeding of 19th InternationalConference on Databased and Expert SystemsApplication (DEXA ’08), 2008; 595-599.DOI 10. 1109/DEXA.2008.26.

[12] Jeong K.M., Lee J.J. and Ha Y.H.,Proceeding of International Conference on ImageAnalysis and Recognition (ICIAR ‘2006),2006; 426-435. DOI 10.1007/11867586_40.

[13] Boreczky J.S. and Rowe L.A., J. ElectronicImaging, 1996; 5(2): 122-128. DOI10.1117/12.238675.

[14] Putpuek N., Le D.D., CooharojananoneN., Satoh S. and Lursinsap C.,Proceeding the 2nd ACM TRECVid VideoSummarization Workshop (TVS ’08) ,2008; 100-104. DOI 10.1145/1463563.1463581.

[15] Lowe D.G., Int. J. Comp. Vision, 2004;60(2): 91-110. DOI 10.1023/B:VISI.0000029664.99615.94.

[16] Fan Q., Barnard K., Amir A., Efrat A.and Lin M., Proceeding of the 8th ACMInternational Workshop on MultimediaInformation Retrieval (MIR’06), 2006;239-247. DOI 10.1145/1178677.1178710.

[17] Le D.D., Satoh S., Ngo T.D. andDuong D.A., Proceeding of 2008 IEEE 10th

Workshop on Multimedia Signal Processing,2008; 702-706. DOI 10.1109/MMSP.2008.4665166.

[18] Cernekova Z., Kotropoulos C. andPitas I., SPIE J. Electronic Imaging, 2007;16(4): 51-59. DOI 10. 1117/1.2812528.

Date post:	23-Mar-2018
Category:	Documents
Upload:	hoangthu
View:	237 times
Download:	3 times

Sequence Matching Based Automatic Retake Detection ... · PDF fileKeywords: rushes video,...

Documents