+ All Categories
Home > Documents > Feature extraction for image selection using machine learning

Feature extraction for image selection using machine learning

Date post: 18-Dec-2021
Category:
Upload: others
View: 10 times
Download: 0 times
Share this document with a friend
55
Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2017 Feature extraction for image selection using machine learning Matilda Lorentzon
Transcript
Page 1: Feature extraction for image selection using machine learning

Master of Science Thesis in Electrical EngineeringDepartment of Electrical Engineering Linkoumlping University 2017

Feature extraction for imageselection using machinelearning

Matilda Lorentzon

Master of Science Thesis in Electrical EngineeringFeature extraction for image selection using machine learning

Matilda LorentzonLiTH-ISY-EX--175097--SE

Supervisor Marcus WallenbergISY Linkoumlping University

Tina ErlandssonSaab Aeronautics

Examiner Lasse AlfredssonISY Linkoumlping University

Computer Vision LaboratoryDepartment of Electrical Engineering

Linkoumlping UniversitySE-581 83 Linkoumlping Sweden

Copyright copy 2017 Matilda Lorentzon

Abstract

During flights with manned or unmanned aircraft continuous recording can result in avery high number of images to analyze and evaluate To simplify image analysis and tominimize data link usage appropriate images should be suggested for transfer and furtheranalysis This thesis investigates features used for selection of images worthy of furtheranalysis using machine learning The selection is done based on the criteria of havinggood quality salient content and being unique compared to the other selected imagesThe investigation is approached by implementing two binary classifications one regard-ing content and one regarding quality The classifications are made using support vectormachines For each of the classifications three feature extraction methods are performedand the results are compared against each other The feature extraction methods used arehistograms of oriented gradients features from the discrete cosine transform domain andfeatures extracted from a pre-trained convolutional neural network The images classifiedas both good and salient are then clustered based on similarity measures retrieved usingcolor coherence vectors One image from each cluster is retrieved and those are the result-ing images from the image selection The performance of the selection is evaluated usingthe measures precision recall and accuracy The investigation showed that using featuresextracted from the discrete cosine transform provided the best results for the quality clas-sification For the content classification features extracted from a convolutional neuralnetwork provided the best results The similarity retrieval showed to be the weakest partand the entire system together provides an average accuracy of 8399

iii

Acknowledgments

First of all I would like to thank my supervisor Marcus Wallenberg at ISY for expertiseand support throughout the thesis work I would also like to thank my examiner LasseAlfredsson at ISY for valuable feedback Also thanks to my supervisor Tina Erlandssonfor the opportunity to do my thesis work at Saab Aeronautics as well as for showing greatinterest in my work

Last but not least I would like to thank my family and friends for love support andcoffee breaks

Linkoumlping 2017Matilda Lorentzon

v

Contents

Notation ix

1 Introduction 111 Motivation 112 Aim 113 Limitations 2

2 Related theory 321 Available data 322 Machine learning 423 Support Vector Machines 524 Histogram of oriented gradients 725 Features extracted from the discrete cosine transform domain 926 Features extracted from a convolutional neural network 13

261 Convolutional neural networks 13262 Extracting features from a pre-trained network 15

27 Color coherence vector 16

3 Method 1731 Feature extraction 1832 Predictor 1933 Similarity retrieval 1934 Evaluation 2035 Generation of training and evaluation data 21

4 Results 2541 Quality classification 2542 Content classification 2843 Similarity retrieval 3044 The entire system 34

5 Discussion 3551 Results 35

vii

viii Contents

511 Quality classification 35512 Content classification 37513 Similarity retrieval part 37514 The entire system 38

52 Method 3953 Possible improvements 39

6 Conclusions 41

Bibliography 43

Notation

Abbreviations

Abbreviation MeaningDCT Discrete cosine transformSVM Support vector machinesHOG Histogram of oriented gradientsRGB Red green blueSSIM Structural similarityROC Receiver operating characteristic

ix

1Introduction

11 Motivation

The collection of image data is increasing rapidly for many organisations within the fieldsof for example military law enforcement and medical science As sensors and massstorage devices become more capable and less expensive the data collection increases andthe databases being accumulated grow larger eventually making it impossible for analyststo screen all of the data collected in a reasonable time This is why computer assistancebecomes increasingly important and when searching by meta-data is impractical the onlysolution is to search by image content [5]

During flights with manned or unmanned aircraft continuous recording can result ina very high number of images to analyze and evaluate The images are assumed to be eval-uated by automatic target recognition functions as well as image analysts on the groundand also by pilots during missions The images may contain interesting objects like ve-hicles buildings or people but most contain nothing of interest for the reconnaissancemission A single target can often be found in multiple images which are similar to eachother The images can also be of different interpretation quality meaning that propertieslike different lightning conditions and blur affect the userrsquos ability to interpret the imagecontent To simplify image analysis and to minimize data link usage appropriate imagesare suggested for transfer and analysis

12 Aim

The aim of the masterrsquos thesis is to investigate which features in images that can be usedto select images worthy of further analysis This is done by implementing two classifica-tions one regarding quality and one regarding content In the first classification imageswill be binarily classified as either good or bad depending on the image quality In thisreport good and bad refers to the two quality classes The images classified as good will

1

2 1 Introduction

continue to the next classification where they will be binarily classified as either salient ornon-salient depending on the image content In this report salient and non-salient refersto the two content classes The images classified as salient will continue to the next stepwhere the final retrieval will be done depending on similarity measures In the case wherethere is a set of images that are almost identical the image with the highest certainty ofbeing good and salient will be retrieved What is interesting content in an image dependson the use case and data set

The masterrsquos thesis will answer the following questions

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between good and bad quality images

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between salient and non-salient content in images

bull Is it possible to make a good image selection using machine learning classificationsbased on both image content and quality followed by a retrieval based on similaritymeasures

13 Limitations

The investigation is limited to an example data set which is modified to fit the task Badquality images are limited to the distortion types described in section 35 which are addedto the images Similar images are retrieved synthetically from one image The investiga-tion is limited to only using one classification model for all classifications The classifica-tions and retrievals are done using one salient class at a time

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 2: Feature extraction for image selection using machine learning

Master of Science Thesis in Electrical EngineeringFeature extraction for image selection using machine learning

Matilda LorentzonLiTH-ISY-EX--175097--SE

Supervisor Marcus WallenbergISY Linkoumlping University

Tina ErlandssonSaab Aeronautics

Examiner Lasse AlfredssonISY Linkoumlping University

Computer Vision LaboratoryDepartment of Electrical Engineering

Linkoumlping UniversitySE-581 83 Linkoumlping Sweden

Copyright copy 2017 Matilda Lorentzon

Abstract

During flights with manned or unmanned aircraft continuous recording can result in avery high number of images to analyze and evaluate To simplify image analysis and tominimize data link usage appropriate images should be suggested for transfer and furtheranalysis This thesis investigates features used for selection of images worthy of furtheranalysis using machine learning The selection is done based on the criteria of havinggood quality salient content and being unique compared to the other selected imagesThe investigation is approached by implementing two binary classifications one regard-ing content and one regarding quality The classifications are made using support vectormachines For each of the classifications three feature extraction methods are performedand the results are compared against each other The feature extraction methods used arehistograms of oriented gradients features from the discrete cosine transform domain andfeatures extracted from a pre-trained convolutional neural network The images classifiedas both good and salient are then clustered based on similarity measures retrieved usingcolor coherence vectors One image from each cluster is retrieved and those are the result-ing images from the image selection The performance of the selection is evaluated usingthe measures precision recall and accuracy The investigation showed that using featuresextracted from the discrete cosine transform provided the best results for the quality clas-sification For the content classification features extracted from a convolutional neuralnetwork provided the best results The similarity retrieval showed to be the weakest partand the entire system together provides an average accuracy of 8399

iii

Acknowledgments

First of all I would like to thank my supervisor Marcus Wallenberg at ISY for expertiseand support throughout the thesis work I would also like to thank my examiner LasseAlfredsson at ISY for valuable feedback Also thanks to my supervisor Tina Erlandssonfor the opportunity to do my thesis work at Saab Aeronautics as well as for showing greatinterest in my work

Last but not least I would like to thank my family and friends for love support andcoffee breaks

Linkoumlping 2017Matilda Lorentzon

v

Contents

Notation ix

1 Introduction 111 Motivation 112 Aim 113 Limitations 2

2 Related theory 321 Available data 322 Machine learning 423 Support Vector Machines 524 Histogram of oriented gradients 725 Features extracted from the discrete cosine transform domain 926 Features extracted from a convolutional neural network 13

261 Convolutional neural networks 13262 Extracting features from a pre-trained network 15

27 Color coherence vector 16

3 Method 1731 Feature extraction 1832 Predictor 1933 Similarity retrieval 1934 Evaluation 2035 Generation of training and evaluation data 21

4 Results 2541 Quality classification 2542 Content classification 2843 Similarity retrieval 3044 The entire system 34

5 Discussion 3551 Results 35

vii

viii Contents

511 Quality classification 35512 Content classification 37513 Similarity retrieval part 37514 The entire system 38

52 Method 3953 Possible improvements 39

6 Conclusions 41

Bibliography 43

Notation

Abbreviations

Abbreviation MeaningDCT Discrete cosine transformSVM Support vector machinesHOG Histogram of oriented gradientsRGB Red green blueSSIM Structural similarityROC Receiver operating characteristic

ix

1Introduction

11 Motivation

The collection of image data is increasing rapidly for many organisations within the fieldsof for example military law enforcement and medical science As sensors and massstorage devices become more capable and less expensive the data collection increases andthe databases being accumulated grow larger eventually making it impossible for analyststo screen all of the data collected in a reasonable time This is why computer assistancebecomes increasingly important and when searching by meta-data is impractical the onlysolution is to search by image content [5]

During flights with manned or unmanned aircraft continuous recording can result ina very high number of images to analyze and evaluate The images are assumed to be eval-uated by automatic target recognition functions as well as image analysts on the groundand also by pilots during missions The images may contain interesting objects like ve-hicles buildings or people but most contain nothing of interest for the reconnaissancemission A single target can often be found in multiple images which are similar to eachother The images can also be of different interpretation quality meaning that propertieslike different lightning conditions and blur affect the userrsquos ability to interpret the imagecontent To simplify image analysis and to minimize data link usage appropriate imagesare suggested for transfer and analysis

12 Aim

The aim of the masterrsquos thesis is to investigate which features in images that can be usedto select images worthy of further analysis This is done by implementing two classifica-tions one regarding quality and one regarding content In the first classification imageswill be binarily classified as either good or bad depending on the image quality In thisreport good and bad refers to the two quality classes The images classified as good will

1

2 1 Introduction

continue to the next classification where they will be binarily classified as either salient ornon-salient depending on the image content In this report salient and non-salient refersto the two content classes The images classified as salient will continue to the next stepwhere the final retrieval will be done depending on similarity measures In the case wherethere is a set of images that are almost identical the image with the highest certainty ofbeing good and salient will be retrieved What is interesting content in an image dependson the use case and data set

The masterrsquos thesis will answer the following questions

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between good and bad quality images

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between salient and non-salient content in images

bull Is it possible to make a good image selection using machine learning classificationsbased on both image content and quality followed by a retrieval based on similaritymeasures

13 Limitations

The investigation is limited to an example data set which is modified to fit the task Badquality images are limited to the distortion types described in section 35 which are addedto the images Similar images are retrieved synthetically from one image The investiga-tion is limited to only using one classification model for all classifications The classifica-tions and retrievals are done using one salient class at a time

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 3: Feature extraction for image selection using machine learning

Abstract

During flights with manned or unmanned aircraft continuous recording can result in avery high number of images to analyze and evaluate To simplify image analysis and tominimize data link usage appropriate images should be suggested for transfer and furtheranalysis This thesis investigates features used for selection of images worthy of furtheranalysis using machine learning The selection is done based on the criteria of havinggood quality salient content and being unique compared to the other selected imagesThe investigation is approached by implementing two binary classifications one regard-ing content and one regarding quality The classifications are made using support vectormachines For each of the classifications three feature extraction methods are performedand the results are compared against each other The feature extraction methods used arehistograms of oriented gradients features from the discrete cosine transform domain andfeatures extracted from a pre-trained convolutional neural network The images classifiedas both good and salient are then clustered based on similarity measures retrieved usingcolor coherence vectors One image from each cluster is retrieved and those are the result-ing images from the image selection The performance of the selection is evaluated usingthe measures precision recall and accuracy The investigation showed that using featuresextracted from the discrete cosine transform provided the best results for the quality clas-sification For the content classification features extracted from a convolutional neuralnetwork provided the best results The similarity retrieval showed to be the weakest partand the entire system together provides an average accuracy of 8399

iii

Acknowledgments

First of all I would like to thank my supervisor Marcus Wallenberg at ISY for expertiseand support throughout the thesis work I would also like to thank my examiner LasseAlfredsson at ISY for valuable feedback Also thanks to my supervisor Tina Erlandssonfor the opportunity to do my thesis work at Saab Aeronautics as well as for showing greatinterest in my work

Last but not least I would like to thank my family and friends for love support andcoffee breaks

Linkoumlping 2017Matilda Lorentzon

v

Contents

Notation ix

1 Introduction 111 Motivation 112 Aim 113 Limitations 2

2 Related theory 321 Available data 322 Machine learning 423 Support Vector Machines 524 Histogram of oriented gradients 725 Features extracted from the discrete cosine transform domain 926 Features extracted from a convolutional neural network 13

261 Convolutional neural networks 13262 Extracting features from a pre-trained network 15

27 Color coherence vector 16

3 Method 1731 Feature extraction 1832 Predictor 1933 Similarity retrieval 1934 Evaluation 2035 Generation of training and evaluation data 21

4 Results 2541 Quality classification 2542 Content classification 2843 Similarity retrieval 3044 The entire system 34

5 Discussion 3551 Results 35

vii

viii Contents

511 Quality classification 35512 Content classification 37513 Similarity retrieval part 37514 The entire system 38

52 Method 3953 Possible improvements 39

6 Conclusions 41

Bibliography 43

Notation

Abbreviations

Abbreviation MeaningDCT Discrete cosine transformSVM Support vector machinesHOG Histogram of oriented gradientsRGB Red green blueSSIM Structural similarityROC Receiver operating characteristic

ix

1Introduction

11 Motivation

The collection of image data is increasing rapidly for many organisations within the fieldsof for example military law enforcement and medical science As sensors and massstorage devices become more capable and less expensive the data collection increases andthe databases being accumulated grow larger eventually making it impossible for analyststo screen all of the data collected in a reasonable time This is why computer assistancebecomes increasingly important and when searching by meta-data is impractical the onlysolution is to search by image content [5]

During flights with manned or unmanned aircraft continuous recording can result ina very high number of images to analyze and evaluate The images are assumed to be eval-uated by automatic target recognition functions as well as image analysts on the groundand also by pilots during missions The images may contain interesting objects like ve-hicles buildings or people but most contain nothing of interest for the reconnaissancemission A single target can often be found in multiple images which are similar to eachother The images can also be of different interpretation quality meaning that propertieslike different lightning conditions and blur affect the userrsquos ability to interpret the imagecontent To simplify image analysis and to minimize data link usage appropriate imagesare suggested for transfer and analysis

12 Aim

The aim of the masterrsquos thesis is to investigate which features in images that can be usedto select images worthy of further analysis This is done by implementing two classifica-tions one regarding quality and one regarding content In the first classification imageswill be binarily classified as either good or bad depending on the image quality In thisreport good and bad refers to the two quality classes The images classified as good will

1

2 1 Introduction

continue to the next classification where they will be binarily classified as either salient ornon-salient depending on the image content In this report salient and non-salient refersto the two content classes The images classified as salient will continue to the next stepwhere the final retrieval will be done depending on similarity measures In the case wherethere is a set of images that are almost identical the image with the highest certainty ofbeing good and salient will be retrieved What is interesting content in an image dependson the use case and data set

The masterrsquos thesis will answer the following questions

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between good and bad quality images

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between salient and non-salient content in images

bull Is it possible to make a good image selection using machine learning classificationsbased on both image content and quality followed by a retrieval based on similaritymeasures

13 Limitations

The investigation is limited to an example data set which is modified to fit the task Badquality images are limited to the distortion types described in section 35 which are addedto the images Similar images are retrieved synthetically from one image The investiga-tion is limited to only using one classification model for all classifications The classifica-tions and retrievals are done using one salient class at a time

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 4: Feature extraction for image selection using machine learning

Acknowledgments

First of all I would like to thank my supervisor Marcus Wallenberg at ISY for expertiseand support throughout the thesis work I would also like to thank my examiner LasseAlfredsson at ISY for valuable feedback Also thanks to my supervisor Tina Erlandssonfor the opportunity to do my thesis work at Saab Aeronautics as well as for showing greatinterest in my work

Last but not least I would like to thank my family and friends for love support andcoffee breaks

Linkoumlping 2017Matilda Lorentzon

v

Contents

Notation ix

1 Introduction 111 Motivation 112 Aim 113 Limitations 2

2 Related theory 321 Available data 322 Machine learning 423 Support Vector Machines 524 Histogram of oriented gradients 725 Features extracted from the discrete cosine transform domain 926 Features extracted from a convolutional neural network 13

261 Convolutional neural networks 13262 Extracting features from a pre-trained network 15

27 Color coherence vector 16

3 Method 1731 Feature extraction 1832 Predictor 1933 Similarity retrieval 1934 Evaluation 2035 Generation of training and evaluation data 21

4 Results 2541 Quality classification 2542 Content classification 2843 Similarity retrieval 3044 The entire system 34

5 Discussion 3551 Results 35

vii

viii Contents

511 Quality classification 35512 Content classification 37513 Similarity retrieval part 37514 The entire system 38

52 Method 3953 Possible improvements 39

6 Conclusions 41

Bibliography 43

Notation

Abbreviations

Abbreviation MeaningDCT Discrete cosine transformSVM Support vector machinesHOG Histogram of oriented gradientsRGB Red green blueSSIM Structural similarityROC Receiver operating characteristic

ix

1Introduction

11 Motivation

The collection of image data is increasing rapidly for many organisations within the fieldsof for example military law enforcement and medical science As sensors and massstorage devices become more capable and less expensive the data collection increases andthe databases being accumulated grow larger eventually making it impossible for analyststo screen all of the data collected in a reasonable time This is why computer assistancebecomes increasingly important and when searching by meta-data is impractical the onlysolution is to search by image content [5]

During flights with manned or unmanned aircraft continuous recording can result ina very high number of images to analyze and evaluate The images are assumed to be eval-uated by automatic target recognition functions as well as image analysts on the groundand also by pilots during missions The images may contain interesting objects like ve-hicles buildings or people but most contain nothing of interest for the reconnaissancemission A single target can often be found in multiple images which are similar to eachother The images can also be of different interpretation quality meaning that propertieslike different lightning conditions and blur affect the userrsquos ability to interpret the imagecontent To simplify image analysis and to minimize data link usage appropriate imagesare suggested for transfer and analysis

12 Aim

The aim of the masterrsquos thesis is to investigate which features in images that can be usedto select images worthy of further analysis This is done by implementing two classifica-tions one regarding quality and one regarding content In the first classification imageswill be binarily classified as either good or bad depending on the image quality In thisreport good and bad refers to the two quality classes The images classified as good will

1

2 1 Introduction

continue to the next classification where they will be binarily classified as either salient ornon-salient depending on the image content In this report salient and non-salient refersto the two content classes The images classified as salient will continue to the next stepwhere the final retrieval will be done depending on similarity measures In the case wherethere is a set of images that are almost identical the image with the highest certainty ofbeing good and salient will be retrieved What is interesting content in an image dependson the use case and data set

The masterrsquos thesis will answer the following questions

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between good and bad quality images

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between salient and non-salient content in images

bull Is it possible to make a good image selection using machine learning classificationsbased on both image content and quality followed by a retrieval based on similaritymeasures

13 Limitations

The investigation is limited to an example data set which is modified to fit the task Badquality images are limited to the distortion types described in section 35 which are addedto the images Similar images are retrieved synthetically from one image The investiga-tion is limited to only using one classification model for all classifications The classifica-tions and retrievals are done using one salient class at a time

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 5: Feature extraction for image selection using machine learning

Contents

Notation ix

1 Introduction 111 Motivation 112 Aim 113 Limitations 2

2 Related theory 321 Available data 322 Machine learning 423 Support Vector Machines 524 Histogram of oriented gradients 725 Features extracted from the discrete cosine transform domain 926 Features extracted from a convolutional neural network 13

261 Convolutional neural networks 13262 Extracting features from a pre-trained network 15

27 Color coherence vector 16

3 Method 1731 Feature extraction 1832 Predictor 1933 Similarity retrieval 1934 Evaluation 2035 Generation of training and evaluation data 21

4 Results 2541 Quality classification 2542 Content classification 2843 Similarity retrieval 3044 The entire system 34

5 Discussion 3551 Results 35

vii

viii Contents

511 Quality classification 35512 Content classification 37513 Similarity retrieval part 37514 The entire system 38

52 Method 3953 Possible improvements 39

6 Conclusions 41

Bibliography 43

Notation

Abbreviations

Abbreviation MeaningDCT Discrete cosine transformSVM Support vector machinesHOG Histogram of oriented gradientsRGB Red green blueSSIM Structural similarityROC Receiver operating characteristic

ix

1Introduction

11 Motivation

The collection of image data is increasing rapidly for many organisations within the fieldsof for example military law enforcement and medical science As sensors and massstorage devices become more capable and less expensive the data collection increases andthe databases being accumulated grow larger eventually making it impossible for analyststo screen all of the data collected in a reasonable time This is why computer assistancebecomes increasingly important and when searching by meta-data is impractical the onlysolution is to search by image content [5]

During flights with manned or unmanned aircraft continuous recording can result ina very high number of images to analyze and evaluate The images are assumed to be eval-uated by automatic target recognition functions as well as image analysts on the groundand also by pilots during missions The images may contain interesting objects like ve-hicles buildings or people but most contain nothing of interest for the reconnaissancemission A single target can often be found in multiple images which are similar to eachother The images can also be of different interpretation quality meaning that propertieslike different lightning conditions and blur affect the userrsquos ability to interpret the imagecontent To simplify image analysis and to minimize data link usage appropriate imagesare suggested for transfer and analysis

12 Aim

The aim of the masterrsquos thesis is to investigate which features in images that can be usedto select images worthy of further analysis This is done by implementing two classifica-tions one regarding quality and one regarding content In the first classification imageswill be binarily classified as either good or bad depending on the image quality In thisreport good and bad refers to the two quality classes The images classified as good will

1

2 1 Introduction

continue to the next classification where they will be binarily classified as either salient ornon-salient depending on the image content In this report salient and non-salient refersto the two content classes The images classified as salient will continue to the next stepwhere the final retrieval will be done depending on similarity measures In the case wherethere is a set of images that are almost identical the image with the highest certainty ofbeing good and salient will be retrieved What is interesting content in an image dependson the use case and data set

The masterrsquos thesis will answer the following questions

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between good and bad quality images

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between salient and non-salient content in images

bull Is it possible to make a good image selection using machine learning classificationsbased on both image content and quality followed by a retrieval based on similaritymeasures

13 Limitations

The investigation is limited to an example data set which is modified to fit the task Badquality images are limited to the distortion types described in section 35 which are addedto the images Similar images are retrieved synthetically from one image The investiga-tion is limited to only using one classification model for all classifications The classifica-tions and retrievals are done using one salient class at a time

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 6: Feature extraction for image selection using machine learning

viii Contents

511 Quality classification 35512 Content classification 37513 Similarity retrieval part 37514 The entire system 38

52 Method 3953 Possible improvements 39

6 Conclusions 41

Bibliography 43

Notation

Abbreviations

Abbreviation MeaningDCT Discrete cosine transformSVM Support vector machinesHOG Histogram of oriented gradientsRGB Red green blueSSIM Structural similarityROC Receiver operating characteristic

ix

1Introduction

11 Motivation

The collection of image data is increasing rapidly for many organisations within the fieldsof for example military law enforcement and medical science As sensors and massstorage devices become more capable and less expensive the data collection increases andthe databases being accumulated grow larger eventually making it impossible for analyststo screen all of the data collected in a reasonable time This is why computer assistancebecomes increasingly important and when searching by meta-data is impractical the onlysolution is to search by image content [5]

During flights with manned or unmanned aircraft continuous recording can result ina very high number of images to analyze and evaluate The images are assumed to be eval-uated by automatic target recognition functions as well as image analysts on the groundand also by pilots during missions The images may contain interesting objects like ve-hicles buildings or people but most contain nothing of interest for the reconnaissancemission A single target can often be found in multiple images which are similar to eachother The images can also be of different interpretation quality meaning that propertieslike different lightning conditions and blur affect the userrsquos ability to interpret the imagecontent To simplify image analysis and to minimize data link usage appropriate imagesare suggested for transfer and analysis

12 Aim

The aim of the masterrsquos thesis is to investigate which features in images that can be usedto select images worthy of further analysis This is done by implementing two classifica-tions one regarding quality and one regarding content In the first classification imageswill be binarily classified as either good or bad depending on the image quality In thisreport good and bad refers to the two quality classes The images classified as good will

1

2 1 Introduction

continue to the next classification where they will be binarily classified as either salient ornon-salient depending on the image content In this report salient and non-salient refersto the two content classes The images classified as salient will continue to the next stepwhere the final retrieval will be done depending on similarity measures In the case wherethere is a set of images that are almost identical the image with the highest certainty ofbeing good and salient will be retrieved What is interesting content in an image dependson the use case and data set

The masterrsquos thesis will answer the following questions

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between good and bad quality images

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between salient and non-salient content in images

bull Is it possible to make a good image selection using machine learning classificationsbased on both image content and quality followed by a retrieval based on similaritymeasures

13 Limitations

The investigation is limited to an example data set which is modified to fit the task Badquality images are limited to the distortion types described in section 35 which are addedto the images Similar images are retrieved synthetically from one image The investiga-tion is limited to only using one classification model for all classifications The classifica-tions and retrievals are done using one salient class at a time

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 7: Feature extraction for image selection using machine learning

Notation

Abbreviations

Abbreviation MeaningDCT Discrete cosine transformSVM Support vector machinesHOG Histogram of oriented gradientsRGB Red green blueSSIM Structural similarityROC Receiver operating characteristic

ix

1Introduction

11 Motivation

The collection of image data is increasing rapidly for many organisations within the fieldsof for example military law enforcement and medical science As sensors and massstorage devices become more capable and less expensive the data collection increases andthe databases being accumulated grow larger eventually making it impossible for analyststo screen all of the data collected in a reasonable time This is why computer assistancebecomes increasingly important and when searching by meta-data is impractical the onlysolution is to search by image content [5]

During flights with manned or unmanned aircraft continuous recording can result ina very high number of images to analyze and evaluate The images are assumed to be eval-uated by automatic target recognition functions as well as image analysts on the groundand also by pilots during missions The images may contain interesting objects like ve-hicles buildings or people but most contain nothing of interest for the reconnaissancemission A single target can often be found in multiple images which are similar to eachother The images can also be of different interpretation quality meaning that propertieslike different lightning conditions and blur affect the userrsquos ability to interpret the imagecontent To simplify image analysis and to minimize data link usage appropriate imagesare suggested for transfer and analysis

12 Aim

The aim of the masterrsquos thesis is to investigate which features in images that can be usedto select images worthy of further analysis This is done by implementing two classifica-tions one regarding quality and one regarding content In the first classification imageswill be binarily classified as either good or bad depending on the image quality In thisreport good and bad refers to the two quality classes The images classified as good will

1

2 1 Introduction

continue to the next classification where they will be binarily classified as either salient ornon-salient depending on the image content In this report salient and non-salient refersto the two content classes The images classified as salient will continue to the next stepwhere the final retrieval will be done depending on similarity measures In the case wherethere is a set of images that are almost identical the image with the highest certainty ofbeing good and salient will be retrieved What is interesting content in an image dependson the use case and data set

The masterrsquos thesis will answer the following questions

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between good and bad quality images

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between salient and non-salient content in images

bull Is it possible to make a good image selection using machine learning classificationsbased on both image content and quality followed by a retrieval based on similaritymeasures

13 Limitations

The investigation is limited to an example data set which is modified to fit the task Badquality images are limited to the distortion types described in section 35 which are addedto the images Similar images are retrieved synthetically from one image The investiga-tion is limited to only using one classification model for all classifications The classifica-tions and retrievals are done using one salient class at a time

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 8: Feature extraction for image selection using machine learning

1Introduction

11 Motivation

The collection of image data is increasing rapidly for many organisations within the fieldsof for example military law enforcement and medical science As sensors and massstorage devices become more capable and less expensive the data collection increases andthe databases being accumulated grow larger eventually making it impossible for analyststo screen all of the data collected in a reasonable time This is why computer assistancebecomes increasingly important and when searching by meta-data is impractical the onlysolution is to search by image content [5]

During flights with manned or unmanned aircraft continuous recording can result ina very high number of images to analyze and evaluate The images are assumed to be eval-uated by automatic target recognition functions as well as image analysts on the groundand also by pilots during missions The images may contain interesting objects like ve-hicles buildings or people but most contain nothing of interest for the reconnaissancemission A single target can often be found in multiple images which are similar to eachother The images can also be of different interpretation quality meaning that propertieslike different lightning conditions and blur affect the userrsquos ability to interpret the imagecontent To simplify image analysis and to minimize data link usage appropriate imagesare suggested for transfer and analysis

12 Aim

The aim of the masterrsquos thesis is to investigate which features in images that can be usedto select images worthy of further analysis This is done by implementing two classifica-tions one regarding quality and one regarding content In the first classification imageswill be binarily classified as either good or bad depending on the image quality In thisreport good and bad refers to the two quality classes The images classified as good will

1

2 1 Introduction

continue to the next classification where they will be binarily classified as either salient ornon-salient depending on the image content In this report salient and non-salient refersto the two content classes The images classified as salient will continue to the next stepwhere the final retrieval will be done depending on similarity measures In the case wherethere is a set of images that are almost identical the image with the highest certainty ofbeing good and salient will be retrieved What is interesting content in an image dependson the use case and data set

The masterrsquos thesis will answer the following questions

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between good and bad quality images

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between salient and non-salient content in images

bull Is it possible to make a good image selection using machine learning classificationsbased on both image content and quality followed by a retrieval based on similaritymeasures

13 Limitations

The investigation is limited to an example data set which is modified to fit the task Badquality images are limited to the distortion types described in section 35 which are addedto the images Similar images are retrieved synthetically from one image The investiga-tion is limited to only using one classification model for all classifications The classifica-tions and retrievals are done using one salient class at a time

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 9: Feature extraction for image selection using machine learning

2 1 Introduction

continue to the next classification where they will be binarily classified as either salient ornon-salient depending on the image content In this report salient and non-salient refersto the two content classes The images classified as salient will continue to the next stepwhere the final retrieval will be done depending on similarity measures In the case wherethere is a set of images that are almost identical the image with the highest certainty ofbeing good and salient will be retrieved What is interesting content in an image dependson the use case and data set

The masterrsquos thesis will answer the following questions

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between good and bad quality images

bull Can any of the provided feature extraction methods produce features useful fordifferentiating between salient and non-salient content in images

bull Is it possible to make a good image selection using machine learning classificationsbased on both image content and quality followed by a retrieval based on similaritymeasures

13 Limitations

The investigation is limited to an example data set which is modified to fit the task Badquality images are limited to the distortion types described in section 35 which are addedto the images Similar images are retrieved synthetically from one image The investiga-tion is limited to only using one classification model for all classifications The classifica-tions and retrievals are done using one salient class at a time

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 10: Feature extraction for image selection using machine learning

2Related theory

This chapter covers the related theory which supports the methods used in this thesisUnless anything else is specified the content of a paragraph is supported in the referencesspecified at the end of the paragraph without case specific modifications

21 Available data

The data used is the COCO - Common Objects in Context [10] data set which contains91 different object categories such as food animals and vehicles It contains many non-iconic images of the objects in their natural environment as oppose to iconic images whichtypically have a large object in a canonical perspective centered in the image Non-iconicimages contain more contextual information and the object in non-canonical perspectivesFigure 21 shows examples of iconic and non-iconic images from the COCO data set

(a) Iconic image (b) Non-iconic image (c) Non-iconic image

Figure 21 Examples of images from the data set containing the object cat (a) isan iconic image while (b) and (c) are non-iconic

3

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 11: Feature extraction for image selection using machine learning

4 2 Related theory

22 Machine learning

Machine learning is the concept of learning from large sets of existing data to make pre-dictions about new data Itrsquos based on creating models from observations called trainingdata for data-driven decision making The concept is illustrated by a flow chart in figure22 where the vertical part of the flow is called the training part and the horizontal part iscalled the evaluation part [18]

New Data Model Prediction

MachineLearning

Algorithm

TrainingData

Figure 22 The concept of machine learning where a machine learning algorithmcreates a decision model from training data The model is then used to make predic-tions about new data (Flow chart drawn according to [18])

There are different types of machine learning models this report focuses the onecalled supervised learning In supervised learning the input training data have correspond-ing outputs and the goal is to find a function or model that correctly maps the inputs tothe outputs That is in contrast to unsupervised learning for which the input data has nocorresponding output The goal of unsupervised learning is to model the underlying struc-ture or distribution of the input data to create corresponding outputs [18] A common useof supervised machine learning is classification where the observations are labelled withclasses and the prediction outputs are different classes It can be described in a simplemanner as finding the function f that fulfills Y = f (X) where X contains the input ob-servations and and Y the corresponding output classes With X and Y as matrices thedescription becomes as follows

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 12: Feature extraction for image selection using machine learning

23 Support Vector Machines 5

class(observation1)class(observation2)

= fobservation1

observation2

(21)

Y is a column vector where each row contains the class of the corresponding rows inX Each row in X corresponds to an observation which is represented by the values alsocalled features in its columns These values can be measurements such ash weight andheight but when it comes to images the compilation of the values in X becomes morecomplex [14] Raw pixel values can be used as features for images but for other thansimple cases the representation is not descriptive enough specially when working withnatural images The aim is to represent an image by distinctive attributes that diversethe observations from one class from the other Therefore an important step when usingmachine learning on images is feature extraction [7] In figure 22 the feature extraction isa big part of the first step in both the training part and the evaluation part There are manymethods for feature extraction this thesis covers three of them histogram of orientedgradients in section 24 features extracted from the discrete cosine domain in section 25and features extracted from a pre-trained convolutional neural network in section 26

23 Support Vector Machines

Support vector machines (SVM) is a form of supervised machine learning model Bylearning from provided examples -the training data- the model finds a function that cou-ples input data to the correct output The output for novel data can then be predicted byapplying the retrieved function SVM is often used for classification problems for whichthe correct output is the class the data belongs to The model works by creating a hyper-plane that separates data points from one class from those from the other class with amargin as high as possible The margin is the maximal width of the slab parallel to thehyperplane that has no interior data points The support vectors which give the modelits name are the data points closest to the hyperplane and therefore determine the marginThe margin and the support vectors are illustrated in 23

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 13: Feature extraction for image selection using machine learning

6 2 Related theory

Figure 23 Illustration of the hyperplane separating data points from two classesshown as + and - The support vectors and the margin are marked Figure drawnaccording to [11]

The data might not allow for a separating hyperplane in that case a soft margin canbe used which means that the hyperplane separates many but not all data points Thedata for training is a set of vectors xj along with their classes yj where j is a traininginstance j = 1 2 l and l is the number of training instances The hyperplane can becreated in a higher dimensional space if separating the classes requires it The hyperplaneis described by wTϕ(xj ) + w0 = 0 where ϕ is a function that maps xj to a higher-dimensional space and w is the normal to the hyperplane The SVM classifier satisfies thefollowing conditions

wTϕ(xj ) + w0 ge +1 if yj = +1wTϕ(xj ) + w0 le minus1 if yj = minus1 j = 1 2 l

(22)

and classifies according to the following decision function

y(x) = sign[wTϕ(xj ) + w0

] (23)

where ϕ non-linearly maps x to the high-dimensional feature space A linear separationis then performed in the feature space which is illustrated in 24

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 14: Feature extraction for image selection using machine learning

24 Histogram of oriented gradients 7

Figure 24 Illustration of the non-linear mapping of ϕ from the input space to thehigh-dimension feature space The figure shows an example which maps from a 2-dimensional input space to a 3-dimensional feature space but the resulting featurespace can be of higher dimensions In both spaces the data points of different classesshown as + and - are on different sides of the hyperplane but in the high-dimensionalspace they are linearly separable Figure drawn according to [2]

If the feature space is high-dimensional performing computations in that space iscomputationally heavy Therefore a kernel function is introduced which is used to mapthe original non-linear observations into higher dimensional space more efficiently Thekernel function can be expressed as a dot product in a high-dimensional space Throughthe kernel function all computations are performed in the low-dimensional input spaceThe kernel function is

K(x xprime) = ϕ(x)Tϕ(xprime) (24)

which is equal to the inner product of the two vectors x and xprime in the feature space Usingkernels a new non-linear decision function is retrieved

y(x) = sign

lsumj=1

yjK(x xprime) + w0

(25)

which corresponds to the form of the hyperplane in the input space [2] [11]

24 Histogram of oriented gradients

Histogram of oriented gradients (HOG) is a commonly used feature extraction method formachine learning implementations for object detection It works by describing an imageas a set of local histograms which in turn represent occurrences of gradient orientations ina local part of the image The image is divided into blocks with 50 overlap each blockis in turn divided into cells Due to the overlap of the blocks one cell can be present in

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 15: Feature extraction for image selection using machine learning

8 2 Related theory

more than one block For each pixel in each cell the gradients in the x and y directions(Gx and Gy) are calculated The gradients represent the edges in an image in the twodirections and are illustrated in image 25

(a) Original image

(b) Gradient in the x direction Gx (c) Gradient in the y direction Gy

Figure 25 An image and its gradient representations in the x and y directions

The magnitude and phase of the gradients are then calculated according to

r =radicG2x + G2

y (26)

θ = arctan(GyGx

)(27)

For each cell a histogram of orientations is created The phases are used to vote intobins which are equally spaced between 0 minus 180 when using unsigned gradients Usingunsigned gradients means that whether an edge goes from dark to bright or from bright

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 16: Feature extraction for image selection using machine learning

25 Features extracted from the discrete cosine transform domain 9

to dark does not matter To achieve that angles below 0 are increased by 180 andangles above 180 are decreased by 180 The vote from each angle is weighted bythe corresponding magnitude of the gradient The histograms are then normalized withrespect to the cells in the same block Finally the histograms for all cells are concatenatedinto a vector which is the resulting feature vector [20] [8] The resulting histograms forall cells in an image is shown as rose plots in figure 26

(a) Image with rose plots (b) Zoomed in

Figure 26 The histograms of each cell in the image is visualized using rose plotsThe rose plots shows the edge directions which are normal to the gradient directionsused in the histograms Each bin is represented by a petal of the rose plot The lengthof the petal indicates the size of that bin meaning the contribution to that directionThe histograms have bins between 0 minus180 which makes the rose plots symmetric[12]

25 Features extracted from the discrete cosinetransform domain

Representing an image or an image patch I of size M times N in the discrete cosine domainis done by transforming the image pixel values according to

Bpq = αpαqMminus1summ=0

Nminus1sumn=0

Imn cos(π(2m + 1)p

2M

)cos

(π(2n + 1)q

2N

)(28)

where 0 le p le M minus 1 0 le q le N minus 1

αp =

1radicM p = 0radic

2M 1 le p le M minus 1(29)

and

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 17: Feature extraction for image selection using machine learning

10 2 Related theory

αq =

1radicN p = 0radic

2N 1 le p le N minus 1(210)

As seen in equation (28) the image is represented as a sum of sinusoids with varyingfrequencies and magnitudes after the transform The benefit of representing an imagein the DCT domain is that most of the visually significant information in the image isconcentrated in just a few coefficients which represent frequencies instead of pixel values[13]

It has been shown that natural undistorted images exhibit strong structural dependen-cies These dependencies are local spatial frequencies that interfere constructively anddestructively over scales to produce the spatial structure in natural scenes Features thatare extracted from the discrete cosine transform (DCT) domain are defined by [19] whichrepresent image structure and whose statistics are observed to change with image distor-tions The structural information in natural images can loosely be described as smooth-ness texture and edge information

The features are extracted from an image by splitting the image into equally sizedN times N blocks with two pixel overlap between neighbouring blocks For each block2D local DCT coefficients are calculated using the discrete cosine transform described inequation (28) Then a generalized Gaussian density model shown in equation (211) isintroduced and used to approximate the distribution of DCT image coefficients

f (x|α β γ) = α exp (minus(β|x minus micro|)γ ) (211)

where x is the multivariate random variable micro is the mean γ is the shape parameter αand β are the normalizing and scale parameters given by

α =βγ

2Γ (1γ)(212)

β =1σ

radicΓ (3γ)Γ (1γ)

(213)

where σ is the standard deviation and Γ is the gamma function given by

Γ (z) =

infinint0

tzminus1 exp(minust) dt (214)

The generalized Gaussian density model is applied to each block of DCT componentsand to special partitions within each block An example of a 5 times 5 sized block and itspartitions are illustrated in figure 32a One of these partitions emerge when each blockis partitioned into three radial frequency sub-bands which are represented as differentlevels of shadings in figure 27b The other partition emerge when each block is splitdirectionally into three oriented sub-regions which are represented as different levels ofshadings in figure 27c

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 18: Feature extraction for image selection using machine learning

25 Features extracted from the discrete cosine transform domain 11

(a) A 5 times 5 block inan image on which theparameters γ and ζ arecalculated

(b) A 5 times 5 block splitinto radial frequencysub-bands a on whichRa is calculated

(c) A 5times block split intooriented sub-bands b onwhich ζb is calculated

Figure 27 Illustrations of the dct components in a block which an image is splitinto and the partitions created in each of the blocks (Image source [19])

Then four parameters derived from the generalized Gaussian model parameters arecomputed These four parameters make up the features used for each image The retrievedvalues of each parameter is pooled in two different ways resulting in two features perparameters The parameters are as follows

bull The generalized Gaussian model shape parameter γ seen in equation (211) whichis a model-based feature that is retrieved over all blocks in the image The parameterγ determines the shape of the Gaussian distribution hence how the frequencies aredistributed in the blocks Figure 28 illustrates the generalized Gaussian distributionin equation (211) for different values of the parameter γ

Figure 28 Generalized Gaussian distribution for different values of γ

The parameter γ is retrieved by inserting values in the range 03-10 in equation

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 19: Feature extraction for image selection using machine learning

12 2 Related theory

(211) to find the distribution which best matches the actual distribution of DCTcomponents in each block The resulting features are the lowest 10th percentile ofγ and the mean of γ

bull The frequency variation coefficient ζ

ζ =σ|X |micro|X |

=

radicΓ (1γ)Γ (3γ)

Γ 2(2γ)minus 1 (215)

where X is a random variable representing the histogrammed DCT coefficients σ|X |and micro|X | are the standard deviation and mean of the DCT coefficient magnitudes ofthe fit to the generalized Gaussian model Γ is the gamma function given by equa-tion (214) and γ is the shape parameter The feature ζ is computed for all blocksin the image The ratio ζ has shown to correlate well with subjective judgement ofperceptual quality The resulting features are the highest 10th percentile of ζ andthe mean of ζ

bull The energy sub-band ratio which is retrieved from the partitions emerging fromsplitting each block into radial frequency sub bands The three sub bands are repre-sented by a where a = 1 2 3 which correspond to lower middle and higher spatialradial frequencies respectively The average energy in sub band a is defined as itsvariance described by

Ea = σ2a (216)

The average energy up to band n is described by

Ejlta =1

n minus 1

sumjlta

Ej (217)

The energy values are retrieved by fitting the DCT histogram in each band a to thegeneralized Gaussian model and then taking the σ2

a from the fit Using the twoparameters Ea and Ejlta a ratio Ra between the components and the sum of thecomponents according to

Ra =|Ea minus Ejlta|Ea + Ejlta

(218)

This ratio represents the relative distribution of energies in lower and higher bandswhich can be affected by distortions A large ratio value is retrieved when there isa large disparity between the frequency energy of a band and the average energy inthe bands of lower frequencies Since band a = 1 does not have any bands of lowerfrequency the ratio is calculated for a = 2 3 and the mean of the two resultingratios R1 and R2 is the feature used The feature is computed for all blocks in theimage The resulting features are the highest 10th percentile of Ra and the mean ofRa

bull The orientation model-based feature ζ which is retrieved from the partitions emerg-ing from splitting each block into oriented sub-regions to capture directional infor-mation ζb is defined according to equation (215) from the model histogram fits

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 20: Feature extraction for image selection using machine learning

26 Features extracted from a convolutional neural network 13

for each of the three orientations b = 1 2 3 The variance of each resulting ζbfrom all the blocks in an image is calculated ζb and the variance of ζb are usedto capture directional information from images since image distortions often affectlocal orientation energy in an unnatural manner The resulting features are the 10thhighest percentile and the mean of the variance of ζ across the three orientationsfrom all the blocks in the image

The features are extracted and the feature extraction is repeated after a low-pass filter-ing and a sub-sampling of the images meaning that the feature extraction is performedover different scales The above eight features are extracted on three scales of the imagesto capture variations in the degree of distortion over different scales The low-pass filter-ing and sub-sampling provides coarser scales on which larger distortions can be capturedsince the entire image is briefed on fewer values as if it was a smaller region The low-pass filtering is with a symmetric Gaussian filter kernel and the sub-sampling is done bya factor of 2

26 Features extracted from a convolutional neuralnetwork

261 Convolutional neural networks

Convolutional neural network (CNN) is a machine learning method which has success-fully been applied to the field of image classification The structure roughly mimics thenature of the mammalian visual cortex and neural networks in the brain It is inspired bythe human visual system because of its ability to recognize and localize objects withincluttered scenes That ability is desired within artificial system in order to overcome thechallenges of recognizing objects in a class despite high in-class variability and perspec-tive variability [4]

Convolutional neural networks is a form of artificial neural networks The structureof an artificial neural network is shown in figure 29

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 21: Feature extraction for image selection using machine learning

14 2 Related theory

Figure 29 The structure of an artificial neural network A simple neural networkwith three layers an input layer one hidden layer and an output layer (Image source[15])

An artificial neural network consists of neurons in multiple layers the input layer theoutput layer and one or more hidden layers Networks with two or more hidden layersare called deep neural networks The input layer consists of an input data and the outputlayer consists of a value indicating whether the neuron is activated or not In the case ofclassification the neurons in the output layer represent the different classes Each of theneurons in the output layer results in a soft-max value which describes the probability ofthe input belonging to that class The input to a neuron is the weighted outputs of theneurons in the previous layer if a layer is fully connected it consists of the output from allneurons in the previous layer The weight controls the amount of influence the output of aneuron has on the next neuron The hidden layers each consists of different combinationsof the weighted outputs of the previous layers That way with increased number of hiddenlayers more complex decisions can be made The method can simplified be described ascomposing complex combinations of the information about the input data which correctlymaps the input data to the correct output In the training part when the network is trainedthose complex combinations are formed which can be thought of as a classification modelIn the evaluation part that model is used to classify new data [15] Convolutional neuralnetworks is a form of artificial neural networks which is applied to images and has aspecial layer structure which is shown in figure 210

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 22: Feature extraction for image selection using machine learning

26 Features extracted from a convolutional neural network 15

Figure 210 The structure of a convolutional neural network A simple convo-lutional neural network with two convolutional layers each of them followed by asub-sampling layer and finally two fully connected layers (Image source [1])

The hidden layers of a CNN are one or more convolutional layers each followed by apooling layer in succession followed by one or more fully connected layers The convo-lutional layers are feature extraction layers and the last fully connected layer act as theclassifier The convolutional layers in turn consist of two different layers the filter banklayer and the non-linearity layer The inputs and outputs to the convolutional layers arefeature maps represented in a matrix For a 3-color channeled RGB image the dimensionsof that matrix are W times H times 3 where W is the width H is the height and 3 is the numberof feature maps For the first layer the input is the raw image pixel values for each colorchannel The filter bank layers consist of multiple trainable kernels which are convolvedwith the input to the convolution layer with each feature map Each of the kernels detectsa particular feature at every location on the input The non-linearity layer applies a non-linear sigmoid activation function to the output from the filter bank layer In the poolinglayers following the convolutional layers sub-sampling occurs The sub-sampling is donefor each feature map and decreases the resolution of the maps After the convolutionallayers the output is passed on to the fully connected layers In the connected layers dif-ferent weighted combinations of the inputs are formed which in the final step results indecisions about which class the image belongs to [9]

262 Extracting features from a pre-trained network

Using features extracted from pre-trained neural networks trained on large and generaltasks have been shown to produce useful results which outperforms many existing meth-ods and clustering with high accuracy when applied to novel data sets It has shown toperform well on new tasks even clustering into categories on which the network was neverexplicitly trained[6] These features extracted from a deep convolutional neural network(CNN) are retrieved from the VGG-F network provided by MatConvNetrsquos archive of opensource implementations of pre-trained models The network contains 5 convolutional lay-ers and 3 fully connected layers The features are extracted from the neuronrsquos activity inthe penultimate layer resulting in 1000 soft-max values The network is trained on a largedata set containing 12 million images used for a 1000 object category classification taskThe features extracted are to be used as descriptors applicable to other data sets [3]

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 23: Feature extraction for image selection using machine learning

16 2 Related theory

27 Color coherence vector

A color coherence vector consists of a pair of measures for each color describing howmany coherent pixels and how many incoherent pixels there are of that color in the imageA pixel is coherent if it belongs to a contiguous region of the color larger than a presetthreshold value Therefore unlike color histograms which only provide information aboutthe quantity of each color color coherence vectors also provide some spatial informationabout how the colors are distributed in the image A color coherence vector for an imageconsists of

lt (α1 β1) (αn βn) gt j = 1 2 nwhere αj is the number of coherent pixels βj is the number of incoherent pixels for colorj and n is the number of indexed colors

By comparing the color coherence vectors of two images a similarity measure isretrieved The similarity measure between two images I and I prime is then given by thefollowing parameters

differentiating pixels =nsumj=1

|αj minus αprimej | + |βj minus βprimej | (219)

similarity = 1 minus differentiating pixelsall pixels lowast 2

(220)

[17]

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 24: Feature extraction for image selection using machine learning

3Method

This chapter includes a description of how the different parts of the system are imple-mented A flowchart of how the different parts of the system interrelate is shown in Figure31 The implementation is divided into two parts a training part and an evaluation partFor both parts the first step is feature extraction from the images which is described insection 31 In the training part features are extracted from one content training set con-taining examples of images with salient and non-salient images and one quality trainingset which contains examples of images with good and bad quality The features are sentto the predictor which creates a classification model for each training set one quality clas-sification and one content classification model The predictor is described in section 32In the evaluation part features are extracted from an evaluation set The features are usedto classify the images according to the classification models retrieved in the training partImages that are classified as both good and salient will continue to the final step in theevaluation part The final step is a retrieval step where one image is selected from a clusterof images that are very similar to each other The retrieval step is described in section 33After passing through the three selection steps the images that are left are classified asgood salient and unique which means that they are worthy of further analysis

17

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 25: Feature extraction for image selection using machine learning

18 3 Method

Trainingset quality

Trainingset

content

FeatureExtraction

FeatureExtraction

Predictor

Predictor

QualityClassification

Model

FeatureExtraction

Evaluation set

bad

ContentClassification

Modelnon-salient

Similarityretrieval

Images Worthy ofFurther Analysis

Training

Evaluation

FeatureExtraction

good

salient

Figure 31 Flow chart of implementation The system is trained on two differentinput sets which leads to two classification models one for quality and one forcontent The evaluation set is classified using the two models the images that areclassified as both good and salient will be sent to the retrieval part In the retrievalpart a selection will be made from sets of images that are similar so that only onewill be retrieved The resulting images are good salient and unique which meansthat they are worthy of further analysis

31 Feature extraction

Three different methods of feature extraction are performed which leads to three differentresults for each classification which are compared against each other The best featureextraction method for each of the two classifications is used for that part and the entiresystem is put togetherThe methods that are used are the following histogram of orientedgradients (HOG) [20] features extracted from the discrete cosine (DCT) domain [21] andfeatures extracted from a pretrained convolutional neural network (CNN) [3] The featureextraction methods have different advantages which are the reasons for why they are cho-sen HOG is often used for object detection it uses gradients to describe images Sincegradients provide information about edges and corners in an image HOG is favorablewhen describing content in an image The method of extracting features from the DCTdomain on the other hand is chosen because the features are produced to describe quality

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 26: Feature extraction for image selection using machine learning

32 Predictor 19

parameters in an image The last method using features extracted from a CNN wherethe network is trained on a large set of images in an object recognition task to be able togeneralize to other tasks and data sets for which the network has not been trained Themethod is chosen because of its ability to perform well on generic tasks

32 Predictor

The predictor used is an SVM as described in section 2 using the MATLAB implementa-tion [11] The model is trained on labelled examples of images of good and bad qualityto retrieve a quality classification model Another SVM model is trained on labelled ex-amples of salient and non-salient images to retrieve a content classification model Whenusing a model to classify new data the resulting output for each image is a class label anda certainty score matrix The score matrix contains the scores for each image being classi-fied in the negative class and the positive class respectively The predictor SVM is chosenbecause of its advantages one of them being not having the problem of over-fitting Over-fitting occurs when a model has too many features relative to the number of observationsand results in poor predictive performance The problem of over-fitting is relevant to takeinto account when working with machine learning on images because the number of fea-tures extracted from an image is often very large [16] SVM has previously been used inmany image classification tasks with good results [20] [19]

33 Similarity retrieval

The retrieval step is performed on images that are classified as both good and salient Onthose images pairwise similarity measures is done based on difference in color coherencevectors of the images according to [17] The difference in color coherence vectors of twoimages consists of difference in number of coherent pixels and number of incoherentpixels of each color The threshold value that determines whether a contiguous area iscoherent or not is 2500 pixels which correstponds to 10 of an image The images arefirst low-pass filtered using a local averaging filter of size 5 times 5 pixels The images arethen converted from RGB valued to indexed valued with 128 different colors using thecolormap jet

The images are then clustered based on the similarity measures The pairwise similar-ity measures from all images in a set form a similarity matrix which is then clustered Theclustering is done by placing an image in a cluster if it has an average similarity above87 to that cluster The average similarity between an image and a cluster is the meanvalue of the pairwise similarity measures between an image and all images in the clusterFrom each cluster only one image is retrieved and that is the one with the highest sum ofthe score for being classified in the good quality class and the score for being classifiedin the salient class The result is a set of images which are all unique compared to eachother

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 27: Feature extraction for image selection using machine learning

20 3 Method

34 Evaluation

The system is evaluated using the results from the evaluation part and how well it con-forms with the ground truth for the evaluation set Each of the classifications and theretrieval is evaluated separately For binary classification the resulting output for everyimage is either the positive or the negative class which is either true or false This meanseach image can be described as a truefalse positivenegative

For the retrieval part the resulting output for each image is whether it should beretrieved or not which is either true or false This means that every image can be describedas a truefalse negativepositive

After evaluating each part separately the system is put together For each of the classifi-cations the feature extraction method which provided the best resulting average accuracyis used The results of the entire system is then evaluated That is done by describingwhich images are retrieved as worthy of further analysis and how well it conforms withwhich images that should be Images that are worthy of further analysis are images thatare good salient and unique with respect to the other retrieved images The final outputfor an image is whether its retrieval is true or false the same way as for the retrieval partThat way truefalse negativespositives are achieved

All results will be evaluated using the measures precision recall and accuracy whichare defined as

Precision =true positives

true positives + false positives(31)

which describes how many of the retrieved images which should be retrieved

Recall =true positives

true positives + false negatives(32)

which describes how many of the images that should be retrieved that are retrieved

Accuracy =true positives + true negatives

all samples(33)

which describes how many classifications that are out of all classifications made Theconcept of truefalse negativespositives and the measures are illustrated in the in figure32

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 28: Feature extraction for image selection using machine learning

35 Generation of training and evaluation data 21

(a) Parts of a quantity of images

(b) Precision (c) Recall (d) Accuracy noise

Figure 32 An illustration of the concept used in the definition of the measuresprecision recall and accuracy Out of a quantity of images some are selected whichare noted positives and can be either true or false The non-selected images are callednegatives which can be either true or false The different concepts are illustrated in(a) and how they define the measures is illustrated in (b) (c) and (d)

35 Generation of training and evaluation data

The COCO data set consists of objects sorted into 91 different categories to fit the tasknew categories are formed One category is set to form the salient class the investiga-tion is performed multiple times with different objects as salient The non-salient classcontain images which are randomly selected from other categories than the one chosen assalient The images have been manually weeded by removing non-representative imagessuch as animated images collages and images of questionable quality After the weedingit is assumed that the images are of good quality to begin with and are placed in the goodclass The data is modified to fit the task by modifying quality parameters to degrade theimage quality in the following way brightening darkening adding salt and pepper-noise

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 29: Feature extraction for image selection using machine learning

22 3 Method

adding Gaussian noise adding Gaussian blur and adding motion blur To avoid the alter-ations counteracting each other they are divided into the two groups light and noiseblurThe modification is done randomly and one image can be subject to one alteration aloneor a combination of two alterations To one image at most one alteration from each groupis applied The degree of the degradation is randomized and the degraded image is thencompared to the original using the structural similarity (SSIM) index introduced in [21]SSIM provides an objective measurement of the quality of an image compared to a ref-erence image The measurement focuses on comparing how well the structures in theimage are preserved and considers image degradations as perceived changes in structuralinformation The images that have an SSIM value above 65 have more than 65 of theirstructures preserved and are set to belong to the good class The images that have SSIMvalue 65 or less are assumed to be of bad quality and make up the bad class Examplesof images which have been degraded to SSIM = 65 are shown in figure 33

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 30: Feature extraction for image selection using machine learning

35 Generation of training and evaluation data 23

(a) Original image (b) Brightened and Gaussian blurred

(c) Motion blurred (d) Darkened and added salt and pep-per noise

Figure 33 An image and examples of degraded versions of it the original is seenin (a) and the degraded versions are seen in (b) (c) and (d) The degraded imageshave been subjects to different degradation methods and have the same SSIM indexasymp 65

Each class is divided into a training part and an evaluation part The images aredivided into approximately 80 training data and 20 evaluation data The number oftraining images in the salient class is approximately 2000 but varies slightly dependingon which object is set to salient The number of training images in the non-salient classis approximately the same as the number of training images in the corresponding salientclass The number of images in the evaluation data set from the two classes are 920 forall different salient objects The number of images in the classes good and bad differsin both the training set and the evaluation set The quality training set consists of thecontent training set and modified versions of them and the quality evaluation set consistsof the content evaluation set and modified versions of them The good class consists of allimages in the salient and the non-salient class and the modified versions of them having

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 31: Feature extraction for image selection using machine learning

24 3 Method

an SSIM value above 65 The bad class consists of the modified versions of the imagesin the salient and non-salient class that have an SSIM value less than or equal to 65Therefore the number of bad images are always less than the number of good imagesThe modification is done randomly which means that the number of bad images variesdepending on what object is set to salient

The data is modified to fit the task also by creating images that are very similar toeach other That is done by applying one or more rigid transformations to an image andtherefore creating different versions of it That is done without changing the saliencyof the images meaning that the salient object is present in all versions of the imagesImages that originate from the same image are assumed to be similar and belong to thesame cluster Examples of images that are set to similar are shown in image 34 Allimages have been resized and cropped to obtain the size 500 times 500 pixels

Figure 34 Examples of similar images that originate from the same image andbelong to the same cluster

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 32: Feature extraction for image selection using machine learning

4Results

41 Quality classification

The evaluation of the quality classification is done for each of the salient objects Foreach salient object a set of 1840 images is used for evaluation Each set consists of bothsalient and non-salient images 920 images have been modified randomly as describedin section 35 and 920 images have not The images that have an SSIM value above 65should be classified as bad and the rest as good Since the degradation is done randomlythe number of good and bad images in the evaluation set varies with the salient objectsThe number of images in the good class is always larger than the number of images inthe bad class and therefore classifying all images as good gives a recall value of 100a precision value same as the classification accuracy which is equal to the proportion ofgood images If the difference in number of images in the two classes is large enoughclassifying all images as good might lead to a false perception of good results Thereforethe proportion of good images needs to be considered when interpreting the results Theproportion of good images for the different salient objects is shown in table 41 Theresults of the quality classification are shown in table 42 The results are visualized usingreceiver operating characteristic (ROC) curves shown in figure 41 The ROC-curves showsthe relation between true positive rate (recall) and true negative rate

Table 41 The proportion of good images for the different salient objects

Proportion good images Salient object06951 cat07288 airplane06935 umbrella06821 handbag06902 motorbike

25

26 4 Results

Table 42 Results from the evaluation of the quality classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 08399 0939 08332 catHOG 08544 09799 08636 airplaneHOG 08018 09702 0813 umbrellaHOG 08333 09442 08332 handbagHOG 08506 09236 08353 motorbikeHOG 08360 09514 08357 averageExtracted from the DCT domain 09196 09116 08832 catExtracted from the DCT domain 09292 09500 09109 airplaneExtracted from the DCT domain 09348 09444 09158 umbrellaExtracted from the DCT domain 09348 09251 09049 handbagExtracted from the DCT domain 09308 09425 09120 motorbikeExtracted from the DCT domain 09298 09347 09054 averageFeatures extracted from a CNN 06951 1 06951 catFeatures extracted from a CNN 07288 1 07288 airplaneFeatures extracted from a CNN 06935 1 06935 umbrellaFeatures extracted from a CNN 06821 1 06821 handbagFeatures extracted from a CNN 06902 1 06902 motorbikeFeatures extracted from a CNN 06979 1 06979 average

41 Quality classification 27

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 41 ROC-curves for the quality classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from the DCT domain has the highest accuracy for all salient ob-jects Therefor this is the feature extraction method used for the quality part when puttingthe entire system together

28 4 Results

42 Content classification

The evaluation of the content classification is done for each of the salient objects For eachsalient object a set of 920 images without modifications is used for evaluation 460 ofthose images are salient containing the salient object and 460 are non-salient containingrandom images from other categories The number of images in the two categories areequal which makes the values for precision recall and accuracy easy to interpret Theguess of placing all images in one class would lead to an accuracy of 50 and one of thevalues for precision or recall to 100 and the other to 50 depending on which class theimages are placed in The results of the content classification are shown in table 43 Theresults are visualized using ROC-curves shown in figure 42 The ROC-curves shows therelation between true positive rate (recall) and false positive rate

Table 43 Results from the evaluation of the content classification for the differentfeature extraction methods and for different categories as salient

Feature extraction method Precision Recall Accuracy Salient objectHOG 06631 06717 06652 catHOG 08645 08043 08391 airplaneHOG 05959 05739 05924 umbrellaHOG 06759 06348 06652 handbagHOG 05758 07348 05967 motorbikeHOG 06750 06839 06717 averageExtracted from the DCT domain 06253 06239 06250 catExtracted from the DCT domain 08182 06457 07511 airplaneExtracted from the DCT domain 06223 06196 06217 umbrellaExtracted from the DCT domain 06256 05630 0613 handbagExtracted from the DCT domain 05881 07326 06098 motorbikeExtracted from the DCT domain 06559 06370 06441 averageFeatures extracted from a CNN 09038 07761 08467 catFeatures extracted from a CNN 1 06935 08467 airplaneFeatures extracted from a CNN 08155 08457 08272 umbrellaFeatures extracted from a CNN 07560 06804 07304 handbagFeatures extracted from a CNN 09242 08217 08772 motorbikeFeatures extracted from a CNN 08799 07635 08256 average

42 Content classification 29

(a) HOG features (b) Features extracted from the DCT do-main

(c) Features extracted from a CNN

Figure 42 ROC-curves for the content classifications The curves show the rela-tion between true positive rate (recall) and false positive rate (false positivesall neg-atives) (a) shows the results from using HOG features (b) shows the results fromusing features extracted from the DCT domain and (c) shows the results from usingfeatures extracted from a CNN The different salient objects are shown as differentcolors

Features extracted from a CNN has the highest accuracy for all salient objects There-for this is the feature extraction method used for the content part when putting the entiresystem together

30 4 Results

43 Similarity retrieval

The evaluation of the retrieval part of the system is done for each of the salient objectsFor each salient object a set of 360 salient images are used for evaluation 180 images areunique and 180 images belong to a cluster of similar images Each set contains 62 clustersof varying sizes with 2-6 images in each cluster The ideal output from the retrievalpart is one image from each cluster The scores that determine which image from eachcluster that should be retrieved are results of the classifications When investigating onlythe retrieval part the results from the classifications should not affect the outcome andtherefore all images are set to have the same score Hence the results of the evaluation ofthe retrieval depends solely on the clustering based on the similarity measures Examplesof images from the similarity retrieval with the salient object cat and their color coherencevectors are shown in figure 44 The similarity matrix containing the pairwise similaritymeasures of all images in the similarity set with the salient object cat is shown in figure45a Also shown is a binary similarity showing the true clusters as yellow in 45b Theresults from the retrieval part is shown in table 44

43 Similarity retrieval 31

(a) (b)

(c)

Figure 43 Examples of images that are clustered as similar and images that are notImages (a) and (b) are placed in the same similarity cluster with similarity 9118Image (c) is not placed in the same cluster and have resulting similarities 3246 to(a) and 3206 to (b)

32 4 Results

(a) Color coherence vector of image 43a

(b) Color coherence vector of image 43b

(c) Color coherence vector of image 43c

Figure 44 Color coherence vectors of images in figure 43 The x-axis are theindexed colors and the y-axis are the number of pixels in logarithmic scale The redbars represent α which is the number of coherent pixels for each color The blackbars represent β which is the number of incoherent pixels for each color

43 Similarity retrieval 33

(a) Resulting similarity matrix

(b) Binary similarity matrix showing images that originatefrom the same image

Figure 45 Matrices of pairwise similarity measures for the images in the similaritysub-set of the category cat (a) is the resulting similarity matrix and (b) is a binarymatrix showing the true similar as 1 and the rest as 0 Filling an entire similaritymatrix would mean calculating the similarity measures between two images twicewhich is avoided and results in upper triangular matrices

34 4 Results

Table 44 Results from the evaluation of the retrieval part for different categories assalient

Precision Recall Accuracy Salient object07782 09421 07806 cat08071 08471 07611 airplane07698 08843 07444 umbrella07537 08471 07111 handbag07935 09050 07778 motorbike07805 08851 07550 average

44 The entire system

The entire system is put together using the quality classification models retrieved usingfeatures extracted from the DCT domain It is the feature extraction method which pro-vided the best results when investigating the quality classification in section 41 Themodels used for the content classifications are the ones retrieved using features extractedfrom a CNN It is the feature extraction method which provided the best results wheninvestigating the content classification in section 42 The evaluation of the entire systemis done for each of the salient objects The evaluation is performed on the same sets as theevaluation of the quality classification which contains the evaluation sets from the contentclassification and the similarity retrieval The output from the quality classification is in-put to the content classification and the output from the content classification is input tothe similarity retrieval part The results from the similarity retrieval part are the imagesthat are evaluated compared to the images which are wanted The images that are wantedare the ones which are actually good salient unique and best from its cluster There arefewer images that are wanted than images that are not since half of the images are salientand some of them are almost duplicates andor bad There are 342 wanted images out ofthe total 1840 images which makes the proportion of wanted images 01859 The resultsof how the entire system works together is seen in table 45

Table 45 Results from the evaluation of the entire system for different categoriesas salient

Precision Recall Accuracy Salient object05944 06813 08543 cat06890 05117 08663 airplane05055 06696 08168 umbrella04717 05117 08027 handbag06169 06404 08592 motorbike05755 06029 08399 average

5Discussion

51 Results

511 Quality classification

The evaluation of the quality classification shows that features extracted from the DCTdomain gives the best results Features extracted from the DCT domain gives an averageaccuracy of 9054 compared to 8357 for HOG and 6979 for features extracted froma CNN When taking the proportion of good images into account it appears that the ac-curacy values for features from a CNN matches the proportion values exactly The factthat the precision values for the method also follows the proportion values and that therecall is always 1 implies from equations 31-33 that there are no true negatives or falsenegatives The SVM was not able to create a good classification model using this methodbut simply classifies all images as good This can be seen in the ROC-curve in figure 41cwhere all curves are very close to where the true positive rate equals the false positiverate which is retrieved when placing all images in one class when the proportion of goodimages is 05 The slight differences are due to the proportion of good images not being05 and small variations in the retrieved scores although all scores are above the thresholdfor being good The method of using features extracted from a CNN was chosen becauseof its ability of performing well on new data sets however this task may differ too muchfrom the task for which it was trained to be able to provide separating features For HOGthe recall is overall very high and the precision is lower and almost equal to the accuracywhich implies that most images are classified as good with quite high number of false pos-itives So although it actually finds a classification model it is not a very good one HOGis often used for object detection where it often is desired to disregard quality parameterssuch as lightning and blur Therefore it is no surprise that it does not lead to great resultwhen investigating quality Since gradients describe difference in intensity darkening orbrightening entire images should not change the gradients unless edges disappear andthe histograms of oriented gradients are normalized which can explain why modifications

35

36 5 Discussion

in lightning are hard to detect using HOG Noise and blur should affect the histogramsof oriented gradients Noise should lead to many small intense edges in spread direc-tions Gaussian blur should lead to fewer and weaker edges and motion blur should leadto fewer and weaker edges along the moving direction and many short edges orthogonalto the moving direction However no connection between modification types and imagesthat are classified as bad is found Features extracted from the DCT domain result in goodvalues for precision recall and accuracy which shows that the SVM was able to find agood classification model This is also seen in the ROC-curve in figure 41b Ideal resultsare shown in a ROC-curve as following the left and the top borders the results from fea-tures extracted from the DCT domain are quite close to that appearance The features wereextracted to describe quality parameters in images which makes it reasonable to find thatthat method gives the best result when investigating quality Its features describe smooth-ness texture and edge information which should be affected by noise and blur None ofthem should however be directly affected by different lightning conditions Despite thatno connection between modification type and images that are falsely classified is found

Although the proportion of good images varies slightly between the different salientobjects it is at most 309 percentage units from the mean value The variation in accuracyvalues for the different sets of salient objects overall matches the variation in proportionin good images meaning that the salient objects with slightly higher proportion of goodimages also have slightly higher accuracy Therefore it is possible to interpret the resultsfrom the quality classification as being general and not varying remarkable with the dif-ferent salient objects This can be seen in the ROC-curves in figure 41b and 41c as thedifferent colored curves being similar the difference in proportion of good between thedifferent salient objects however causes slight variations In the ROC-curve for HOG fea-tures in figure 41a the curves are not very similar which is partly because the differentproportions of good images but mostly because it does not provide a good quality classi-fication model HOG provides a poor classification model from which the results variesbetween the different salient objects

The number of good and bad training images varies with the salient object Partlybecause the modification is done randomly but also because the number of images be-ing modified varies The largest good class consists of 6588 images and the smallest4817 Although the number of training observations for each salient object is quite largethe variation may impact the capacity of the resulting quality classification models Thesmall variations in the quality classification results is however more likely caused by thedifferent context in the images

The ROC-curves describe the trade-off between the true positive rate and the falsepositive rate which is basically two different types of errors letting too many imagespass as good or finding too few good images Following a curve gives the resulting truepositive rate and false positive rate when changing how tolerant or strict the threshold forclassifying images as good is In this case where one class is retained and the other is notit might be more important not to discard too many good images than to discard all badimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 41

51 Results 37

512 Content classification

The evaluation of the content classification shows that features extracted from a CNN givesthe best results Features extracted from a CNN gives an average accuracy of 8256 com-pared to 6717 for HOG and 6441 for features extracted from the DCT domain Theaccuracy values have variances 3155 for features extracted from a CNN 10005 forHOG and 6571 for features extracted from the DCT domain Those numbers are allquite high and implies that the content classification is not general and varies significantlywith the different salient objects That can also be seen in the ROC-curves in figure 42as the different colored curves representing different salient objects are differing Figure42b which shows the results from using features extracted from the DCT domain showsthat the curves for the different salient objects are quite similar except for the categoryairplane All curves are rather close to the line where the true positive rate equals thefalse positive rate except for airplane Being close to that line for this case where each ofthe two classes contain half of the images corresponds to simply classifying all imagesin the same class That means that the category airplane is the only one for which a de-cent classification model is retrieved The bad performance of features extracted from theDCT domain for content classification for the majority of the different salient objects isnot astonishing since it uses very few features describing statistics in images associatedwith quality The decent result for the category airplane however is more astonishingsince it is able to differ somewhat between salient and non-salient images only describedby smoothness texture and edge information Features extracted from a CNN are trainedon a large set of images for an object classification task The task is similar to this con-tent classification and the features seem to fulfill their purpose of performing well whenapplied to new data sets HOG are often used for content classification tasks and perform-ing well However this shallow feature extraction method is outperformed by featuresextracted from a deep architecture

The number of salient and non-salient training images is approximately 2000 for eachsalient object but it varies slightly The largest salient class consists of 2418 images andthe smallest 1700 Although the number of training observations for each salient objectis quite large the variation may impact the capacity of the resulting content classificationmodels The variations in the content classification results is however more likely causedby the different content in the images

As described for the quality classification in section 511 if one type of error is pre-ferred over the other In this case where one class is retained and the other is not it mightbe more important not to discard too many salient images than to discard all non-salientimages Then the threshold can be changed and the new rates can be retrieved from theROC-curves in figure 42

513 Similarity retrieval part

The similarity retrieval part gets an average accuracy of 7550 with the best result being7806 and the worst 7111 The result varies with a few percentage points betweenthe different salient objects and the variance in accuracy is 813 That is most likelycaused by the context of the salient objects rather than the objects themselves That isbecause majority of the images consists of mostly context and the color coherence vectors

38 5 Discussion

are calculated over the entire images Applying a transformation to an image with ahomogeneous background still having the salient object present does not cause a changein the color coherence vector as big as it would be if the background were changing Thismight explain why the two sets with the lowest resulting accuracy have the salient objectshandbag and umbrella which are typically found in varying contexts such as crowds ofpeople The sets with the salient objects cat motorbike and airplane has the best resultingaccuracy Those salient objects are often found in relatively homogeneous context suchas indoor environment roads and sky

The similarity threshold was chosen from testing because it gave the best resultingaccuracy on average for the different salient objects As shown in the resulting similaritymatrix for the sub-set of the category cat in figure 45 the resulting similarity valuesare dispersed across the spectrum Therefore the results are very dependent on whichthreshold value is set The value 87 is quite high which is why the recall value is in everycase higher than the precision value In this case where almost-duplicates are removedthat means rather keeping a few similar images than risking the removal of unique images

514 The entire system

The evaluation of the entire system gives an average accuracy of 8399 with the bestresult being 8663 and the worst 8027 The result varies with a few percentage pointsbetween the different salient objects and the variance in accuracy is 799 The classi-fications both have overall high precision values which means that they do not falselyclassify many images as good or salient That and the proportion of wanted images be-ing only 01859 together with the fact that most of the images should be removed duringthe classification steps is a probable cause for the high number of true negatives For allsets most of the correct classifications are true negatives which as shown in equations31-33 affects the accuracy but not the precision and recall which explains why the accu-racy is severely higher than the precision and recall The accuracy values are also higherthan the accuracy values for some of the content classification part and all for the similar-ity retrieval part separately That is also most likely caused by the high number of truenegatives when evaluating the entire system The variance in accuracy being lower forthe entire system than for the separate parts is probably another consequence of the highnumber of true negatives One cause for the overall low precision and recall is that in thesimilarity retrieval part there is one more error cause when the system is put together Theimage that is retrieved from each cluster is the one with the highest score from the classifi-cations All images in a cluster are thought to be equally salient since they all contain thesalient object The quality of the images are decided based on the SSIM values and sinceunmodified images have SSIM =1 only unmodified images retrieved are correct In manycases an image retrieved from a cluster is modified to have SSIM slightly lower than 1 andis therefore counted as falsely classified Although the quality classification scores leadto good classification result they might not correlate well enough to give an image of forexample SSIM =099 lower quality score than an image of SSIM =1 Accepting any imagebeing both good and salient being retrieved from each cluster would probably increasethe precision and recall values

52 Method 39

52 Method

The biggest weakness in the system is the similarity retrieval part which resulted in lowestoverall accuracy of the three parts of the system The similarity retrieval method is rela-tively simple and it if the thesis work would have been of bigger extent a more advancedmethod could have been chosen For the classifications at least one feature extractionmethod provided good results for each part Different feature extraction methods andpredictor might have provided better results but when choosing such it is not often thecase that one method is always outperforming the others but instead it varies much withdata sets and tasks Therefore the biggest remark in methods chosen is the data set Thedata set used in this investigation is an example data set which differs in many ways fromthe data sets for which the system is supposed to be used The images in the data setused are not automatically taken and are not part of the same continuously recorded setOne big difference between the data set used and a set of images that belong to a contin-uously recorded series is that the background is typically more predictable in the latterFor images continuously recorded during a flight the background may roughly consist ofland water and sky from afar in all images meaning that the context is similar for all im-ages For the data set used however the context in the images varies between indoor andoutdoor scenes in different places in the world and from different views In the contentclassification since entire images are set to salient or non-salient it is much likely harderfor the predictor to create an accurate classification model of saliency for the data set usedwhere both objects and context varies much compared to a data set where the context ismore similar That might explain why the category airplane shows better results in thecontent classification for all feature extraction methods Airplanes which are typicallyfound in more homogeneous context than the other categories such as sky and airplanerunways The problem with the variety in context in the data set also affects the similarityretrieval part If the context would be similar the variety in objects present would have themajor impact in the similarity measures which is desired Instead with the data set usedthe context varies much and lower similarity measures are very often caused by variationin context rather than the salient object Since so little is known about the data sets forwhich the system is supposed to be used the investigation is very general The more thatis known about a problem the more can the approach be specialized to solve it Betterresults can probably be achieved when investigating quality if it is known what qualitydistortion types are prevailing since methods can be chosen with more consideration

53 Possible improvements

If one knows more about the data sets for which the system is supposed to be used manyimprovements are possible For example if it is known what kind of context that is typ-ically prevailing during a flight that information can be used to advance the similarityretrieval part The color coherence matrix can be weighted so that colors typically appear-ing in the context of a planned flight can get a lower weight giving a similarity measurewhich is less dependent on the context The images might be processed by an automatictarget recognition system during flights when collecting data but is not available for thisstudy Taking advantage of the results from such a system the position of objects can be

40 5 Discussion

found in images That way instead of investigating entire images only the parts where apotential salient object is found can be investigated

The feature extraction method that provides the best results in the content classifica-tion is the one using features extracted from a pre-trained convolutional neural networkThe network is not trained for the task on which it is evaluated but still outperforms theother methods used That forebodes that using a convolutional neural network trained onthe intended task might provide even better results in the content classification

6Conclusions

Using features from the DCT domain together with the SVM classifier provided very goodresults in differentiating between good and bad quality in images Using features ex-tracted form a CNN together with the SVM classifier provided good results in differentiat-ing between salient and non-salient content in images The classifications together withthe similarity retrieval part form the image selection system The entire system providedacceptable results but holds for improvement

The results are acceptable for a selection system containing many steps but for theintended purpose they are however not good enough Discarding an important image dueto a false classification can result in fatal consequences if an important target is capturesbut dismissed Even when changing the threshold in the classifications to prioritize avoid-ing the error of discarding too many images higher accuracy is desired Since the resultvaries with the sets having different salient objects it is much likely that it varies with datasets as well The data set differs much from the data sets for which it is intended A dataset containing automatically taken flight data does not to the same extent have the prob-lem of varying context which causes difficulties for some parts of the system Thereforusing the system on the intended data set might lead to substantially better results Forbetter results more information than the raw pixel values should be used for examplewhat context is prevailing during a recording and where in the image a potential salientobject is

41

Bibliography

[1] Convolutional neural networks (lenet) URL httpdeeplearningnettutoriallenethtml Cited on page 15

[2] BH Boyle Support Vector Machines Data Analysis Machine Learning and Ap-plications Computer science technology and applications Nova Science Publish-ers 2011 ISBN 9781612093420 URL httpsbooksgooglecoukbooksid=T7tAYgEACAAJ Cited on page 7

[3] K Chatfield K Simonyan A Vedaldi and A Zisserman Return of the devil in thedetails Delving deep into convolutional nets In British Machine Vision Conference2014 Cited on pages 15 and 18

[4] Dan C Ciresan Ueli Meier Jonathan Masci Luca M Gambardella and Juumlr-gen Schmidhuber Flexible high performance convolutional neural networks forimage classification In Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two IJCAIrsquo11 pages1237ndash1242 AAAI Press 2011 ISBN 978-1-57735-514-4 doi 105591978-1-57735-516-8IJCAI11-210 URL httpdxdoiorg105591978-1-57735-516-8IJCAI11-210 Cited on page 13

[5] RL Delanoy Machine learning apparatus and method for image searching Au-gust 11 1998 URL httpswwwgooglecompatentsUS5793888US Patent 5793888 Cited on page 1

[6] Jeff Donahue Yangqing Jia Oriol Vinyals Judy Hoffman Ning Zhang Eric Tzengand Trevor Darrell Decaf A deep convolutional activation feature for generic visualrecognition CoRR abs13101531 2013 URL httparxivorgabs13101531 Cited on page 15

[7] Eren Golge How does feature extraction work on images URL httpswwwquoracomprofileEren-GolgeMachine-LearningHow-does-feature-extraction-work-on-images Cited on page 5

[8] L Greche and N Es-Sbai Automatic system for facial expression recognitionbased histogram of oriented gradient and normalized cross correlation In 2016 In-ternational Conference on Information Technology for Organizations Development

43

44 Bibliography

(IT4OD) pages 1ndash5 March 2016 doi 101109IT4OD20167479316 Cited onpage 9

[9] Yann LeCun Koray Kavukcuoglu and Cleacutement Farabet Convolutional networksand applications in vision In ISCAS pages 253ndash256 IEEE 2010 ISBN 978-1-4244-5309-2 URL httpdblpuni-trierdedbconfiscasiscas2010htmlLeCunKF10 Cited on page 15

[10] Tsung-Yi Lin Michael Maire Serge J Belongie Lubomir D Bourdev Ross BGirshick James Hays Pietro Perona Deva Ramanan Piotr Dollaacuter and C LawrenceZitnick Microsoft COCO common objects in context CoRR abs14050312 2014URL httparxivorgabs14050312 Cited on page 3

[11] MathWorks Support vector machines for binary classifica-tion URL httpssemathworkscomhelpstatssupport-vector-machines-for-binary-classificationhtmlCited on pages 6 7 and 19

[12] MathWorks Extracthogfeatures URL httpssemathworkscomhelpvisionrefextracthogfeatureshtml Cited on page 9

[13] MathWorks Discrete cosine transform URL httpssemathworkscomhelpimagesdiscrete-cosine-transformhtml Cited onpage 10

[14] MathWorks Supervised learning workflow and algorithms URL httpssemathworkscomhelpstatssupervised-learning-machine-learning-workflow-and-algorithmshtmls_tid=conf_addres_DA_eb Cited on page 5

[15] Michael A Nielsen Neural Networks and Deep Learning Determination Press2015 Cited on page 14

[16] Parul Parashar and Er Harish Kundra Comparison of various image classificationmethods International Journal of Advances in Science and Technology (IJAST) 2(1) 2014 Cited on page 19

[17] Greg Pass Ramin Zabih and Justin Miller Comparing images using color coher-ence vectors In Proceedings of the Fourth ACM International Conference on Multi-media MULTIMEDIA rsquo96 pages 65ndash73 New York NY USA 1996 ACM ISBN0-89791-871-1 doi 101145244130244148 URL httpdoiacmorg101145244130244148 Cited on pages 16 and 19

[18] Srini Penchikala Big data processing with apache spark - part 4 Spark ma-chine learning May 2016 URL httpswwwinfoqcomarticlesapache-spark-machine-learning Cited on page 4

[19] MA Saad AC Bovik and C Charrier Blind image quality assessment A naturalscene statistics approach in the dct domain IEEE Transactions on image processing21(8) August 2008 Cited on pages 10 11 and 19

Bibliography 45

[20] F Suard A Rakotomamonjy and A Bensrhair Pedestrian detection using infraredimages and histograms of oriented gradients In in IEEE Conference on IntelligentVehicles pages 206ndash212 2006 Cited on pages 9 18 and 19

[21] Zhou Wang A C Bovik H R Sheikh and E P Simoncelli Image quality as-sessment From error visibility to structural similarity Trans Img Proc 13(4)600ndash612 April 2004 ISSN 1057-7149 doi 101109TIP2003819861 URLhttpdxdoiorg101109TIP2003819861 Cited on pages 18and 22

  • Abstract
  • Acknowledgments
  • Contents
  • Notation
  • 1 Introduction
    • 11 Motivation
    • 12 Aim
    • 13 Limitations
      • 2 Related theory
        • 21 Available data
        • 22 Machine learning
        • 23 Support Vector Machines
        • 24 Histogram of oriented gradients
        • 25 Features extracted from the discrete cosine transform domain
        • 26 Features extracted from a convolutional neural network
          • 261 Convolutional neural networks
          • 262 Extracting features from a pre-trained network
            • 27 Color coherence vector
              • 3 Method
                • 31 Feature extraction
                • 32 Predictor
                • 33 Similarity retrieval
                • 34 Evaluation
                • 35 Generation of training and evaluation data
                  • 4 Results
                    • 41 Quality classification
                    • 42 Content classification
                    • 43 Similarity retrieval
                    • 44 The entire system
                      • 5 Discussion
                        • 51 Results
                          • 511 Quality classification
                          • 512 Content classification
                          • 513 Similarity retrieval part
                          • 514 The entire system
                            • 52 Method
                            • 53 Possible improvements
                              • 6 Conclusions
                              • Bibliography
Page 33: Feature extraction for image selection using machine learning
Page 34: Feature extraction for image selection using machine learning
Page 35: Feature extraction for image selection using machine learning
Page 36: Feature extraction for image selection using machine learning
Page 37: Feature extraction for image selection using machine learning
Page 38: Feature extraction for image selection using machine learning
Page 39: Feature extraction for image selection using machine learning
Page 40: Feature extraction for image selection using machine learning
Page 41: Feature extraction for image selection using machine learning
Page 42: Feature extraction for image selection using machine learning
Page 43: Feature extraction for image selection using machine learning
Page 44: Feature extraction for image selection using machine learning
Page 45: Feature extraction for image selection using machine learning
Page 46: Feature extraction for image selection using machine learning
Page 47: Feature extraction for image selection using machine learning
Page 48: Feature extraction for image selection using machine learning
Page 49: Feature extraction for image selection using machine learning
Page 50: Feature extraction for image selection using machine learning
Page 51: Feature extraction for image selection using machine learning

Recommended