+ All Categories
Home > Documents > Scene Understanding using DSO Cognitive...

Scene Understanding using DSO Cognitive...

Date post: 14-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
8
Scene Understanding using DSO Cognitive Architecture Gee Wah Ng, Xuhong Xiao, Rui Zhong Chan, Yuan Sin Tan DSO National Laboratories 20 Science Park Drive Singapore 118230 Email: {ngeewah, xxuhong, cruizhon, tyuansin}@dso.org.sg AbstractThis paper presents how a novel cognitive architecture, DSO Cognitive Architecture (DSO-CA), has been applied to scene understanding. DSO-CA is able to bring to bear different types of knowledge to solve problems. It has been applied to object recognition in images. The cognitive architecture is able to fuse bottom-up perceptual information, top-down contextual knowledge and visual feedback in a way similar to how human utilizes different knowledge to recognize objects in images or video scene. This enables the cognitive system to achieve scene understanding. Keywords- Cognitive Architecture; Scene Understanding; Information Flow; Object Classification; Top-down; Bottom-up; Contextual Information; Blending Feedback I. INTRODUCTION This paper presents our ongoing effort in employing a novel cognitive architecture, DSO-CA, to build a scene understanding system. The system is able to achieve scene understanding through the fusion of top-down knowledge and bottom-up information. This is done through a two-level process - a region classification process followed by a finer object classification process. A cognitive architecture specifies a computational infrastructure that defines the various regions and functions working as a whole to produce human-like intelligence [1]. This architecture also defines the main connectivity and information flow between various regions and functions. These functions and the connectivity between them in turn facilitate and provide implementation specifications for a variety of algorithms. There exist a number of excellent cognitive architectures but many have overlooked the importance of biological validity. We believe that neuroscience should be the starting point and develop tools from there. Drawing inspirations from various fields, for examples, neuroscience, psychology and biology, a top-level cognitive architecture was developed. Various key parts of the human brain and their functions are identified and included in the design. Some of the desired behaviors are set as design principles. The cognitive architecture also models information processing in the human brain. The human brain is able to process information in parallel and is able to bring to bear different types of knowledge, distributed throughout the brain, to solve a problem. The top-level cognitive architecture design is shown in Fig. 1. Five core regions in the human brain, namely, Frontal Cortex, Perception, Limbic System, Association Cortex and Motor Cortex, are identified. Each of these five regions represents a class of functions in the brain. The corresponding classes of functions are Executive Functions, Perception, Affective Functions, Integrative Functions and Motor Control, respectively. In the next section, a breif description of a prototype cognitive system developed based on this design will be given. This is followed by a discussion on how scene understanding, and image classification in particular, is carried out in the cognitive system. Figure 1. Top-level Cognitive Architecture Design II. PROTOTYPE COGNITIVE SYSTEM A prototype cognitive system (Fig. 2) is developed based on the top-level design. Some functions from each of the five core regions are developed as modules which form the basic building blocks. Figure 2. Prototype Cognitive System 2277
Transcript
Page 1: Scene Understanding using DSO Cognitive Architecturefusion.isif.org/proceedings/fusion12CD/html/pdf/310_167.pdf · architecture is able to fuse bottom-up perceptual information, top-down

Scene Understanding using DSO Cognitive Architecture

Gee Wah Ng, Xuhong Xiao, Rui Zhong Chan, Yuan Sin Tan DSO National Laboratories

20 Science Park Drive Singapore 118230 Email: {ngeewah, xxuhong, cruizhon, tyuansin}@dso.org.sg

Abstract— This paper presents how a novel cognitive architecture, DSO Cognitive Architecture (DSO-CA), has been applied to scene understanding. DSO-CA is able to bring to bear different types of knowledge to solve problems. It has been applied to object recognition in images. The cognitive architecture is able to fuse bottom-up perceptual information, top-down contextual knowledge and visual feedback in a way similar to how human utilizes different knowledge to recognize objects in images or video scene. This enables the cognitive system to achieve scene understanding.

Keywords- Cognitive Architecture; Scene Understanding; Information Flow; Object Classification; Top-down; Bottom-up; Contextual Information; Blending Feedback

I. INTRODUCTION This paper presents our ongoing effort in employing a novel

cognitive architecture, DSO-CA, to build a scene understanding system. The system is able to achieve scene understanding through the fusion of top-down knowledge and bottom-up information. This is done through a two-level process - a region classification process followed by a finer object classification process.

A cognitive architecture specifies a computational infrastructure that defines the various regions and functions working as a whole to produce human-like intelligence [1]. This architecture also defines the main connectivity and information flow between various regions and functions. These functions and the connectivity between them in turn facilitate and provide implementation specifications for a variety of algorithms. There exist a number of excellent cognitive architectures but many have overlooked the importance of biological validity. We believe that neuroscience should be the starting point and develop tools from there.

Drawing inspirations from various fields, for examples, neuroscience, psychology and biology, a top-level cognitive architecture was developed. Various key parts of the human brain and their functions are identified and included in the design. Some of the desired behaviors are set as design principles. The cognitive architecture also models information processing in the human brain. The human brain is able to process information in parallel and is able to bring to bear different types of knowledge, distributed throughout the brain, to solve a problem.

The top-level cognitive architecture design is shown in Fig. 1. Five core regions in the human brain, namely, Frontal

Cortex, Perception, Limbic System, Association Cortex and Motor Cortex, are identified. Each of these five regions represents a class of functions in the brain. The corresponding classes of functions are Executive Functions, Perception, Affective Functions, Integrative Functions and Motor Control, respectively. In the next section, a breif description of a prototype cognitive system developed based on this design will be given. This is followed by a discussion on how scene understanding, and image classification in particular, is carried out in the cognitive system.

Figure 1. Top-level Cognitive Architecture Design

II. PROTOTYPE COGNITIVE SYSTEM A prototype cognitive system (Fig. 2) is developed based

on the top-level design. Some functions from each of the five core regions are developed as modules which form the basic building blocks.

Figure 2. Prototype Cognitive System

2277

Page 2: Scene Understanding using DSO Cognitive Architecturefusion.isif.org/proceedings/fusion12CD/html/pdf/310_167.pdf · architecture is able to fuse bottom-up perceptual information, top-down

A module is the smallest functional unit of the computational architecture and provides a certain capability. A module is fully encapsulated, with its own knowledge base (distributed long term memory), internal representation schemes and inference methods. Thus a module can be treated like a black box. Other modules in the system do not have to know how it works internally. Each module communicates with other modules either directly or through the Relay (Thalamus) module. Since different modules may have different internal representation schemes, a potential communication problem among modules may arise in the computational architecture. This problem can be solved by adopting a common representation scheme for all the outputs of the modules.

Modules that perform similar functions are grouped together into classes. For instance, the Perception class comprises of all modules that perform perceptual functions. The reason for grouping similar modules into classes is because different algorithms may be used to find the solution for different problem spaces. By having the concept of classes, each module in the same class can implement just one specific algorithm. This makes the code of each module smaller and easier to maintain. The modules in a class can have complementary, competitive or cooperative relationships. A meta-module for each class may be required to manage the outputs from the different modules within the class.

The prototype system implements each module as an individual executable program. This is in concordance with the parallelism principle of the cognitive architecture.

A. Description Perception class: Modules belonging to the Perception

class act as receivers to the external world. They take in raw inputs from the external world and process them into useful information. The processed information is then sent to the Relay module for distribution to the rest of the modules in the agent.

Motor class: Modules in the Motor class are used to alter both the external environment and the internal state of the agent. These modules receive instructions from modules such as Selector and apply the necessary actions to the external environment or internal state of the agent.

Association class: Association modules retrieve a list of plausible actions or states when presented with a situation picture. This list of actions or states is associated with the current decision or situation picture. The list is then sent back to the Relay module for further processing by other modules. The current implementation contains a module which builds upon a rule-based engine.

Reasoner class: Reasoner modules analyze situations and propose actions. They are responsible for higher-level reasoning. The current implementation contains a Dynamic Reasoner module, which uses D’Brain [2] for its internal algorithm, and an Associative Reasoner, which is able to perform reasoning on any generic semantic network.

Selector class: The role of Selector modules is to select an action or a decision from a list of proposed actions or decisions

so as to reach the current goals or sub-goals. The current implementation contains a FALCON module [3] which enables reinforcement learning in the cognitive system. This will enable the Selector module to make better selections over time.

Relay module: The Relay module distributes information to the relevant modules and maintains the current situation picture, in a form of working memory, for all the modules in the system. It functions like the Thalamus in the Limbic System. The current Relay module is able to combine information from different modules and distribute the information to the relevant modules, as well as route information according to some parallel pathways defined by a user-specified pathway configuration. The cognitive architecture also has the capability to learn pathways on its own.

Goals Monitoring module: The purpose of the Goals Monitoring module is to produce appropriate sub-goals from the top level goals and then monitor the current situation to check for status of these goals. The status of the goals can be used to update the other modules which may affect their processing of information.

Episodic module: This module consolidates and forms new memory in the form of episodic memory.

III. SCENE UNDERSTANDING IN DSO-CA Fig. 3 presents the building blocks of scene understanding,

with the visual information flow in the prototype cognitive system indicated. For example, top-down facilitation takes place in the Reasoner module. Total scene understanding is achieved via two processes:

• An interactive process among initial classification and top-down facilitation, which emulates the typical bottom-up and top-down interaction [4, 5, 6], and the visual attention processes [7, 8]. Initial classification refers to the early, coarse-level classification (e.g., road, man-made objects) of local scenes based on low-level features, while top-down facilitation is not only responsible for making use of contextual knowledge to resolve uncertainties in initial classification through a contextual analysis process, but also for suggesting regions of interest worthy of further attention as well as initial guesses of possible categories for fine-grained object classification.

• Another process involves fine-grained object classification with a Blending Feedback loop. Fine-grained object classification identifies more specific categories of interesting objects (e.g., pedestrians, bus, cars, etc). A Blending Feedback loop is used to resolve uncertainties such as those caused by adverse conditions like blurring, thereby improving classification accuracies. It boosts classification confidence appropriately by blending input information aptly with "templates" of candidate categories stored as prior knowledge. This is done like how cortical networks in human brains ‘fill in’ incoming visual information which is often incomplete or ambiguous [9].

More details about the two processes will be elaborated in this paper.

2278

Page 3: Scene Understanding using DSO Cognitive Architecturefusion.isif.org/proceedings/fusion12CD/html/pdf/310_167.pdf · architecture is able to fuse bottom-up perceptual information, top-down

Figure 3. Building blocks of scene understanding showing the information flow between modules in the protype cognitive system

A. Context-based and Object-based facilitation Early scene understanding work focused on specific object

classification, assuming the corresponding chips have been segmented from an image. Most algorithms consider each potential target independently and are based solely on measurements of that target. Due to the nature of the images, the performances of these classification methods generally have limited success.

In recent years, more work has been done in total scene understanding [10], which classifies all parts of an image. This, on one hand, is motivated by practical applications requiring full understanding of the situation, instead of individual object detection/recognition. On the other hand, it is driven by the understanding that humans do not rely solely on an image chip alone to conduct classification. In reality, humans also consider other sources of information. For example, a human will look at the whole picture before the image chip is segmented to get extra information that may help him infer the class of image on that segmented chip.

More recent evidences reveal that visual object recognition can be triggered by not only a context-based cortical mechanism, but also an object-based cortical mechanism [11]. Hence, enhancement to the classification performance must consider both context-based and object-based facilitations.

Motivated by the above considerations, as shown in Fig. 3, during top-down facilitation, contextual information and contextual knowledge are used to resolve classification uncertainties and to provide early guesses about the potential object categories. This usage facilitates the fine-grained object classification process and is further facilitated by an object-based Blending Feedback loop. These facilitations make it different from traditional object recognition systems, such as

conducting intensive search of interesting objects via sliding windows [12].

B. Information flow in the DSO-CA for scene understanding The cognitive system has the ability to process different

kinds of knowledge. This is similar to how human uses different types of knowledge to solve problems. For example, there is a need to consider both contextual information and contextual knowledge in the classification process to enhance the classification accuracy. The cognitive system, with its ability to fuse together different types of knowledge, can be used to achieve this.

We, as humans, typically use top-down and bottom-up information to solve problems in a kind of signal-symbol fusion. Contextual knowledge captured in the form of long-term memory is a form of top-down symbolic input modeled after the context-based cortical mechanism while the actual image provides the bottom-up signal information. Contextual knowledge can be seen as a form of prior knowledge which may be learned or gathered through experience. Another top-down information process is feedback to Perception, modeled after the object-based cortical mechanism. Previous research has shown that our visual pathways are not unidirectional [13]; in other words, there are also feedback signals to our visual cortex. The system models this by storing templates (obtained from the same training samples that the Perception module is trained on) in the Association module and retrieving associated templates to send to the Perception module as a visual feedback.

The arrows in Fig. 3 show how perceptual inputs from the image are sent to different parts of the cognitive system via the Relay module. Certain contextual information may be present in the image itself, such as a particular formation of objects or

2279

Page 4: Scene Understanding using DSO Cognitive Architecturefusion.isif.org/proceedings/fusion12CD/html/pdf/310_167.pdf · architecture is able to fuse bottom-up perceptual information, top-down

the existence of other objects in the same image which can help to identify the object of interest. Together with other accompanying contextual information obtained from other sources, for example, Geographic Information System, this information can be extracted and sent together with the classification results to the other parts of the cognitive system. These form the bottom-up information.

Contextual knowledge is stored in the Executive Function. The current implementation uses D’Brain as the reasoning engine, with the contextual knowledge captured in the form of Bayesian Network fragments. The Perception output and the contextual information will instantiate some of these fragments which will piece together to form a situation specific Bayesian Network. In this way, the bottom-up perceptual inputs are fused with the contextual knowledge. The reasoning engine’s inference results then suggests regions of interests for fine-grained object classification. The results are then sent to the Selector module. The Selector module will choose the top few classes (classification classes) based on the results and send them to the Association module via the Relay module.

Next, the Association module will retrieve the corresponding templates based on the selected classes. It then sends them to the Perception module, via the Relay module, as feedback to the Perception module. At the Perception module, the template will be “blended” with the original target chip. The blending mechanism is modeled after the human visual recognition process whereby perceived images are adjusted with respect to preconceived templates. The blended image then goes through classification at the Perception module. This feedback forms part of the top-down information.

IV. INITIAL CLASSIFICATION AND TOP-DOWN FACILITATION

A. Initial classification The initial classification sub-module consists of algorithms

to implement early, coarse-level classification of local image regions. Each image is over-segmented into small coherent regions as described in [14]. The gist of each regions are represented by a feature vector which includes 24 color features, 107 texture features (36 anisotropic Gauss filtering features [15], 12 Gabor filter features [16] and 59 Local Binary Patterns [17]). Popular classifiers, such as Support Vector Machine (SVM) and Multiple Layer Perceptron (MLP) are included in the Perception module and work on such features.

Once the local regions are classified, they are grouped into bigger components. Regions that connect together and belong to the same category form a component. These components constitute the initial classification of the image scene. For example, the image in Fig. 4 is classified into multiple components, with their categories represented by different color codes. Among them, the interesting object generally refers to the obstacles that the vehicle needs to avoid or objects worth further observation (e.g., human, vehicles).

(a) (b)

Figure 4. Example of initial classification. (a): Image; (b): Initial classiifcation result, where categories are indicated by individual colors, with some labelled.

B. Top-down facilitation Top-down facilitation involves two tasks. First, just like

human beings who resolve uncertainties in their local observations based on contextual information, it resolves uncertainties of the initial classification through a contextual analysis process. Second, based on the classification on local regions and task relevant knowledge, it will decide which image area warrants further attention - either because further analysis is required due to significant ambiguity or because the task requires more specific classification to certain objects or events.

D'Brain is the main knowledge representation and reasoning engine in the Reasoner module of the DSO-CA. It represents knowledge as Bayesian Network fragments. The advantage of such a representation is that we can easily represent knowledge from different sources by combining various fragments. For example, knowledge such as "targets can be camouflaged", "targets will appear in groups in a battlefield" can be directly input by human experts and modeled by individual fragments, while other fragments can be learned as contextual information extracted from labeled training data are transformed to represent image-relevant contextual knowledge, such as "sky cannot be surrounded by road area".

Image-based contextual information is extracted from the initial classified components. To achieve it, the existence of objects in its neighborhood is checked from top, bottom and horizontal directions around the bounding box of the component. For each category, two possibilities exist in the top and bottom directions: the category either appears or does not appear along the direction. Along the horizontal direction, there are three possibilities: the category does not appear on either side, or the category appears on either left or right, or the category appears on both left and right sides. These possibilities are presented probabilistically as contextual information for the component under consideration.

Using extracted contextual information from the training samples, the image-relevant contextual knowledge model as shown in Fig. 5 is learned. The node “class” takes all the possible categories as values. The probabilities of this node will be updated based on the evidence from other nodes, which corresponds to different contextual information respectively. For example, the node “topSky” corresponds to whether the top neighbor of the component is “sky”. This learned Bayesian Network fragment forms part of the long-term contextual knowledge in the architecture.

sky road pavement

grass

building highVeg

grass bush

bush highVeg treetrunk obstacle

sky

kerb

2280

Page 5: Scene Understanding using DSO Cognitive Architecturefusion.isif.org/proceedings/fusion12CD/html/pdf/310_167.pdf · architecture is able to fuse bottom-up perceptual information, top-down

Figure 5. Learned Bayesian Network fragment incoporating image-based contextual information

During the contextual analysis phase, contextual information is extracted for those components with low initial classification confidence. Such contextual information is passed to the D'Brain reasoner as evidence. The confidence inferred by the D'Brain reasoner is fused with the initial classification confidence to determine the final category of the corresponding component.

After contextual analysis, we will in most cases, acquire a more accurate coarse-level classification of the scene. For example, the classification after contextual analysis of the image in Fig. 4(a) is shown in Fig. 6(b), in which, the wrong initial classification in the circled areas of Fig. 6(a) are corrected. This rough classification segments the scene into various regions of different categories. The task requirement can then be used to focus the attention to regions of interest. For example, for surveillance purpose, we may want to know the specific types of the interesting objects (e.g., are they human or vehicle?). This will lead to fine-grained object classification.

(a) (b)

Figure 6. Effects of Contextual Analysis. (a). Initial classification result as in Figure 4(b), with some errors circled; (b). Result after contextual analysis, the errors in (a) are corrected.

V. BLENDING FEEDBACK In solving image classification problems, the negative

effects of random noise, distortions and adverse imaging conditions of rain, slight fog, clouds or poor imaging resolution are often present. However, human beings are able to robustly deal with such negative effects by modifying either the input with respect to a prior templates or vice versa. In particular, there are neuroscience evidences of an object-based cortical

mechanism in our brain during object recognition which can be modeled as projecting an initial input image (a blur image for example) from the early visual areas, pre-sensitizations of possible initial guesses of the object and feedback of these guesses for further visual processing. It is this robust human visual recognition system that inspires us to come up with the Blending Feedback (BF) algorithm as a computational model to better solve image classification problems.

BF integrates a number of technologies to enhance image classification. The classical approach is to find a suitable classifier like SVM or neural networks. In BF, such a classifier is taken as a base classifier and then improved upon by integrating the following key ideas and technologies. First, traditional object classification efforts are sequential in a feedforward manner. BF improves classification by coupling the feedforward manner with a feedback approach. Second, there is a stability-based feature blending process which leads to the dynamic selection of different object parts so as to aptly adjust and match a new test input with respect to a template.

The main modules in this Blending Feedback algorithm are Classification, Class Selection, Template Selection, Feedback and Blending (Fig. 7). The Blending Feedback algorithm starts off with Classification module using a suitable classifier like SVM. Class Selection module then selects the top k (k>=1) classes from the classifier outputs. As a default, k is set as 3. Template Selection module then chooses the best matched template among these top k classes. This is then followed by feedback and blending of the best template to classifier. Classification is then done using the blended image.

Class Selection: Its importance draws on the inspiration that human beings do not compare an incoming test image with all stored templates. Class Selection creates a subset of the most probable classes for further distinction and discrimination.

Template Selection: Prior to BF, several templates representing each class are obtained by applying a clustering algorithm on training images. The Template Selection module plays an important role in choosing the best representative template among a subset of classes for blending. Additionally, the Template Selection and the Classifier compete and complement each other in the Blending Feedback algorithm as two vital decision makers working hand in hand.

Figure 7. Schematic Diagram of the Blending Feedback Algorithm

sky road pavement

grass

building highVeg

grass bush

bush highVeg treetrunk obstacle

sky

kerb

road pavement grass

building highVeg

grass bush

bush highVeg treetrunk obstacle

sky

kerb

2281

Page 6: Scene Understanding using DSO Cognitive Architecturefusion.isif.org/proceedings/fusion12CD/html/pdf/310_167.pdf · architecture is able to fuse bottom-up perceptual information, top-down

Blending: Stability-based feature blending models after how humans recognize objects based on definitive features, and dynamically select different object parts to aptly adjust and match a new test input with respect to a template. For example, we can recognize an image as a face if we are able to identify two ears, two eyes, one nose and one mouth as features. In our minds, we have already stored templates based on definitive features, representing different objects of what we know currently at various orientations and circumstances. And we in fact try to match these templates to the test images while trying to do visual recognition, using stability and separability measures obtainable from training data.

Templates are represented and stored as feature vectors. The stability of each feature is also recorded. In the simplest case, the pixels are taken as the features. Let

FTemplate = {fT1, … fT

N} (1)

be a template feature vector of a category, where fTi is the value

of the ith feature in the template. The stability of each feature in the template is measured by its occurrence frequency,

pi = Oi/M (2)

where M is the total number of training samples in the category and Oi is the value of the histogram bin that fT

i falls on in the training samples of that category.

Thus, for a given input represented by feature vector FTestImage = {fI

1, … fIN} and a best matched template candidate

FTemplate, the features of the blended image are obtained as a weighted sum of the corresponding features of a template and of a test image. The ith blended feature is defined as

fBi = wi × fTemplate

i + (1 – wi) × fTestImagei (3)

where wi = c × (1-pi). This weighted sum is dependent on two parameters – the stability values of the template features pi as determined by the occurrence frequency of that feature in the training samples, and the blending effect parameter c (0<c<1) which is a coefficient between 0 and 1 that affects the extent of feature adjustment.

The blending effect parameter c is computed based on two factors. The first factor is the reliability of the templates. The second factor is the relevancy of the templates with respect to the test inputs, i.e. the closeness of match between the test and train distributions.

To handle the first factor of template reliability, we use a function of scatter matrices – scatter-between-clusters or Sb which describes the separability between clusters and scatter-within-cluster or Sw which describes the sparseness of the data points within a cluster. The scatter matrices are generated from the training set automatically.

One common reason for poor classification accuracy is that the selection of feature space is poor and that two classes might be closely located together in that feature space. Such closeness can be described by separability measures or scatter matrices. Ideally, we want classes to be as far apart as possible in the feature space while maintaining each class as compact as possible.

During the training phase, there will be a few clusters per class after the training images are passed through a clustering algorithm. Therefore during the Template Matching process when a test image is passed through the BF mechanism, an additional step will be done. For each class, one cluster among all clusters of that class will be chosen such that the template of that cluster best matches the test input. Thus we will end up with n clusters for n classes. Using these n clusters, the scatter matrices Sb and Sw are computed. Ideally, we want as small a Sw and as large a Sb as possible. With the scatter matrices computed, the blending effect parameter c can also be computed.

For the second factor of template relevancy, we shall make the assumption that the distribution of subsequent test data comes from or are similar to that of the training data. Otherwise, some mapping function between the different distributions should be made available and multiplied to the parameter c.

In short, such blending will reduce the contribution of unstable features and divert the focus of attention to stable features. The blended feature vector will then be passed to the object classification classifier to repeat the classification process.

Feedback: It is well documented that the human visual cortex consists of multiple feedforward and feedback hierarchically-layered networks and there exist pathways that direct specific connections with other cortices [18]. Any established classifiers such as multi-layer perceptron (MLP) or support vector machines (SVMs) can simulate the role of the multiple feedforward networks of the human visual cortex. And substantiating such a classifier with a feedback process will make the current recognition system more robust like in the human visual system.

VI. EVALUATION

A. Top-down facilitation We have labeled 138 images in our image database

collected with a vehicle patrolling at a military site. The major objects appeared in the images are ground, high vegetation (highVeg), grass, sky, water, and obstacles (here, obstacles refer to any objects that may affect vehicle traversability or may worth further attention). 104 images are used for training the initial classification and contextual analysis algorithms, and the remaining 34 images for evaluation. The performance is measure by F-measure, which is defined as

recallprecisionrecallprecisionF

+••= 2

.

As shown in Table I, the performances after contextual analysis are coherently better than that from initial classification. Besides the labeled testing images, we have evaluated the classification performance for more than thousands of unlabelled images via visual analysis (Fig. 8 illustrates examples of the classification results). The classification performance for the massive qualitative evaluation looks quite consistent with that for the labeled testing images. As the F-measures for ground and vegetation

2282

Page 7: Scene Understanding using DSO Cognitive Architecturefusion.isif.org/proceedings/fusion12CD/html/pdf/310_167.pdf · architecture is able to fuse bottom-up perceptual information, top-down

are high, the vehicles can conduct autonomous navigation based on this result. However, the F-measures for obstacles still need improvement. The key reason for the low performance for obstacles is that there are too many varieties of obstacles and objects in the environment, and as is well known, the detection of general obstacles and object remains a challenge in computer vision field.

TABLE I. F-MEASURE FOR INITIAL CLASSIFICATION AND THAT AFTER CONTEXTUAL ANALYSIS

Initial classification After contextual analysis ground 0.946 0.951

highVeg 0.900 0.925 grass 0.627 0.691 sky 0.756 0.772

water 0.861 0.877 obstacles 0.547 0.592

Figure 8. Classification results. 1st row: images; 2nd row: initial classification result; 3rd row: result after contextual analysis.

B. Blending Feedback in Fine-grained Object Classification Using the MSTAR[19] SAR images of armor vehicles, the

blending mechanism is preliminarily tested. In this experiment, blending parameter c is empirically set to 0.3. The accuracy for different categories, with and without blending mechanism, is presented in Table II. On average, the blending mechanism increases the classification accuracy by 3.8%.

TABLE II. CLASSIFICATION ACCURACY FOR MSTAR DATA W/WO BLENDING

Categories Accuracy without blending after blending

SN-9563 0.846 0.912 SN-9566 0.846 0.903 SN-C21 0.857 0.933 SN-C71 0.928 0.918 SN-S7 0.863 0.884 SN-132 0.867 0.897 SN-812 0.892 0.917

Averaged 0.871 0.909

C. Validation of Blending Feedback In the Blending Feedback mechanism, there are concerns of

undesirable effects such as wrong biasness towards a particular class (since blending with a template tends to bias the result toward the class of the template) or wrong self-reinforcement on the classification accuracy during the interplay between the base classifier and the best matched template from the Template Selection module.

Ideally, when the correct template is used, the feedback should help to increase the hit rate by boosting the confidence that the object belongs to the same class as the template. However, when a template of the wrong class is used, it is desired that the false-alarm rate will not be increased. This can help to prevent the system from being biased to a particular class or self-reinforcing wrongly. Experiments done using MNIST handwritten digits dataset of digits 0 to 9 [20] validated that the current Blending Feedback mechanism conform to this principle.

In these experiments, 60000 train samples and 10000 test samples are used. SVM is used as the base classifier and k-means is used for templates creation. An exhaustive 100000 blends are made in these validation experiments which can be viewed as controlled BF mechanisms whereby for each test input, the blending feedback of one template (the best matched within the class) per class is fully explored.

Table III shows the validation results. Blending a correct template with a correctly classified test input increases the confidences of that correct class. In cases where blending is done with a wrong template on an already correctly classified test input, 99.3% of the time does not lead to misclassification. On the other hand, in more difficult cases where blending is done with a correct template on an already wrongly classified test input, 22.3% of the time resulted in the correction of misclassifications.

TABLE III. BF VALIDATION RESULTS

2283

Page 8: Scene Understanding using DSO Cognitive Architecturefusion.isif.org/proceedings/fusion12CD/html/pdf/310_167.pdf · architecture is able to fuse bottom-up perceptual information, top-down

VII. CONCLUSION AND DISCUSSION A cognitive architecture that models after information

processing in the human brain is presented here. It identifies core regions of the human brain and functions that exist in each region. Key design principles inspired by the human brain are discussed and used in the cognitive architecture.

A prototype cognitive system has been developed and described based on the cognitive architecture. One key feature of the cognitive system is its ability to bring to bear different types of knowledge to solve problems. This enables the cognitive architecture to exploit bottom-up perceptual information, top-down contextual knowledge and visual feedback in a way similar to how human utilizes different knowledge to recognize objects in images or video scenes.

The building blocks of scene understanding are detailed by the information flow in the prototype cognitive system. It emulates the human scene understanding process in two ways: (1). Interaction between bottom-up initial classification and top-down facilitation based on contextual information and contextual knowledge; (2). A Blending Feedback loop to "fill-in" the uncertainty information based on a prior templates, so as to enhance the fine-grained object classification capability in adverse situations.

Contextual knowledge for top-down facilitation is represented as Bayesian Network fragments and inferenced via D'Brain reasoner in the cognitive system. Tests on hundreds of images demonstrated the effectiveness of top-down facilitation in resolving uncertainties for initial classification of scenes.

Experiments have also positively validated the possible effects of BF on image classification. This mechanism has been tested with good results obtained on handwritten digit recognition from MNIST database. In these experiments, Blending Feedback has shown to achieve improved classification accuracies and better separability between classes which translate to having greater confidence of the declared class.

The Blending Feedback mechanism can potentially be used for challenging classification problems such as blur and/or partially-occluded images. One such practical use of this mechanism also includes target classification from aerial images whereby the targets are partially-occluded by tree leaves or clouds. Another use is for classifying far-away targets whereby on zooming in, the camera or image capturing device can only capture low-spatial frequency or blur images.

To further improve the performance, the building blocks in the scene understanding process can be further integrated, such as by using objects that have been previously classified by the fine-grained object classification block to aid in subsequent classification of other objects, as another form of top-down

facilitation. Further evaluation and tighter integration of the fine-grained object classification module with contextual knowledge and incoming contextual information can also be done to improve its performance.

Current work includes two areas of visual attention and activity recognition. This will help to bring another step closer to attaining total scene understanding.

REFERENCES [1] A. Newell, Unified Theories of Cognition. Harvard University Press,

Cambridge, 1990. [2] G. W. Ng, K. H. Ng, K. H. Tan, and C. H. K. Goh, “The ultimate

challenge of commander's decision aids: The cognition based dynamic reasoning machine”, in Proceedings of 25th Army Science Conference, Paper BO-05, 2006.

[3] A.H. Tan, G.A. Carpenter, and S. Grossberg “Intelligence through interaction: towards a unified theory for learning”, in Proceedings ISNN, LNCS4491, pp 1098-1107, 2007.

[4] K. Friston, “A theory of cortical responses”, Philosophical Transactions Royal Society London B, Biological Sciences, Vol. 360, pp. 815-836, 2005.

[5] S. Hochstein and M. Ahissar, “View from the top: hierarchies and reverse hierarchies in the visual system”, Neuron, Vol. 36, pp. 791-804, Dec 2002.

[6] M. Bar, “A cortical mechanism for triggering top-down facilitation in visual object recognition”, Journal of Cognitive Neuroscience, Vol. 15, No. 4, pp. 600-609, 2003.

[7] Y. Sun and R. Fisher, “Object-based visual attention for computer vision, Artificial Intelligence”, Vol. 146, Issue 1, pp. 77-123, May 2003.

[8] S. Chikkerur, T. Serre, C. Tan and T. Poggio, “What and where: A Bayesian inference theory of attention”, Visual Research, 2010.

[9] F. Crick and C. Koch. A framework for consciousness. Nature Neuroscience, 6(12): 119-126, Feb 2003.

[10] L. J. Li, R. Socher and F.F. Li, “Towards total scene understanding: classification, annotation and segmentation in an automatic framework”, in Proceedings of CVPR, 2009.

[11] Martinez-Conde, Macknik, Martinez, Alonso & Tse (eds.) Progress in Brain Research, Vol. 155, pp. 3-21, 2006.

[12] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection”, Proceedings of ICPR, pp. 886-893, 2005.

[13] M. C. Edward, “Feedforward, feedback and inhibitory connections in primate visual cortex”, Neural Networks, Vol. 17, No. 5-6, pp. 625-632, 2004.

[14] P. Felzenszwalb and D. Huttenlocker, “Efficient graph-Based image segmentation,” IJCV Vol 2, 2004.

[15] http://www.robots.ox.ac.uk/~vgg/research/textclass/filters.html. [16] http://www.mit.edu/~jmutch/fhlib. [17] T. Ojala, M. Pietikainen and T. Maenpaa, “Multi-resolution gray-scale

and rotation invariant texture classification with local binary patterns”, IEEE Trans. on PAMI, Vol 24, No. 7, July 2002.

[18] G.W. Ng, Brain-Mind Machinery. World Scientific Publisher, 2009. [19] http://cis.jhu.edu/data.sets/MSTAR/ [20] Y. Lecun & C. Cortes, MNIST handwritten digit database, last retrieved

from http://yann.lecun.com/exbd/mnist/ on 6 Jan 2010.

2284


Recommended