Enhancing multiple expert decision combination strategies through exploitation of a priori...

Enhancing multiple expert decision combination strategies through exploitation of a priori information sources

A.F.R.Rahman M.C.Fairhurst

Abstract: In recent years the concept of combining multiple experts in a unified framework to generate a combined decision based on individual decisions delivered by the co- operating experts has been exploited in solving the problem of handwritten and machine printed character recognition. The level of performance achieved in terms of the absolute recognition performance and increased confidences associated with these decisions is very encouraging. However, the underlying philosophy behind this success is still not completely understood. The authors analyse the problem of decision combination of multiple experts from a completely different perspective. It is demonstrated that the success or failure of the decision combination strategy largely depends on the extent to which the various possible sources of information are exploited in designing the decision Combination framework. Seven different multiple expert decision combination strategies are evaluated in terms of this information management issue. It is demonstrated that it is possible to treat the comparative evaluation of the multiple expert decision combination approaches based on their capability for exploiting diverse information extracted from the various sources as a yardstick in estimating the level of performance that is achievable from these combined configurations.

1 Introduction

Optical character recognition has seen great advances in the last two decades. This is a multi-disciplinary branch of pattern recognition which has extensive applications in numerous fields including robotic vision, artificial intelligence, document processing, office automation, human-computer interfaces, man- machine systems, data acquisition, storage and retrieval etc. In recent years, this application area has been extended to forensic science, including identification of individuals using measures depending on biometrics,

0 IEE. 1999 IEE Proceedings online no. 1999001 5 DOL 10.1049/ip-vis:19990015 Paper lirst received 1st April and in revised form 23rd October 1998 The authors are with the Electronic Engineering Laboratory, University of Kent, Canterbury, Kent, CT2 7NT, UK

security and other applications. This widespread prolif- eration of optical recognition applications has been responsible for renewed research in this field. Apart from the extensive theoretical approaches adopted, great ernphasis has been laid on practical implementa- tions of these techniques, using appropriate software and hardware platforms. Many commercial systems have also been marketed offering an array of efficient optical recognition platforms for varied applications or as an auxiliary system supporting other major application products. From the considerations of the demands on the performance of these optical character recognition systems in terms of speed and recognition effi- ciency, it is now established that it is often prudent to design it system based on multiple sources of information and multiple recognition schemes rather than depending on one single classifier to solve complex recognition problems. In recent years therefore, there has been a tendency to use multiple experts and efficient decision combination schemes for decision fusion in designing character recognition systems.

2

In the task of printed and handwritten character recognition, stand-alone experts are often not robust enough to tackle the problem of the huge degree of variation that is present in the samples with respect to size, shape, skew and style. The stand-alone experts are often incapable individually of computing the subtle differences between closely resembling characters, and hence the potential importance of the use of multiple experts in character recognition problems becomes apparent. The employment of multiple experts allows simultaneous use of different feature descriptors and classification procedures. As more and more relevant features are taken into account, the overall characteristics of the target character symbol become clearer. The approach can therefore, exploit the strengths of individual experts and can often avoid their weaknesses, making the overall final decision more robust. Decision combination in the final stages can make use of information redundancy which is often present and take appropriate self-correcting action. Moreover, it allows the exploitation of complementary information extracted from different representations of the same character such as raw, skeletonised or parameterised versions. These observations serve as the basic motivation behind moving from an approach based on individual single experts to a framework which is based on a multiple expert platform.

Motivation for the multiple expert approach

IEE Pro<.- Vi.?. Imugr Signul Proce.%. Vol. 146, No. I , Febriiury I999 40

3 Approaches to multiple expert implementation

Despite all the potential advantages, the implementation of multiple expert configurations is not easy. The principal area of complexity lies in the decision combination strategy, where decisions made by the different classifiers have to be combined in such a way as to make the final decision optimal and to maximise the recognition performance. Over the years, different decision combination strategies have been implemented in the framework of multiple expert configurations. Many of these strategies are based on mathematical reasoning or statistical modelling. On the other hand, there are methods which depend on a priori knowledge of the target symbol set and which try to adapt the recognition process to maximise the use of the information gathered from the feature space, minimising the inter- dependence of the decision making process in the cases of different classes. In general, approaches to the combination of decisions by multiple experts in the framework of the overall configuration can be summarised as follows.

Analytical methods: type-I approach Development, formalisation and implementation of formal methods to combine multiple experts.

Pseudo-analytical methods: type-II approach Development and implementation of quasiformal methods to combine multiple experts.

Empirical methods: type-Ill approach Development and implementation of specialised and task-oriented tailored methods.

Neural network-based methods: type-IV approach Development of formal or informal methods for combining multiple expert decision combination by employing neural network techniques.

In general, some of the methods of type-I1 and type- 111 can be expressed as belonging to one or other of the combination methods formulated by type-I combination strategies. However, not all these methods can be classified directly as belonging to one of the formal methods, and sometimes some of these specialised methods can be expressed as a combination of other methods. Some others can be expressed as partly belonging to a group of strategies.

4 analysis

Success of multiple expert approaches: an

It has been demonstrated by many researchers that multiple expert combination approaches are very successful in solving complex problems (e.g. handwritten character recognition). It is very interesting to analyse why combined decisions become superior to decisions delivered by individual experts. It is noted that all the decision combination frameworks use information about the dataset and the performances of individual experts on these databases, and this information is incorporated either implicitly or explicitly in the decision combination algorithms. Although all the decision combination approaches use this information to some extent, the success of a particular approach depends on the success with which this information can be incorporated into the decision making process. It is therefore, very important to analyse the types of information that can be extracted by a decision combination approach

IEE Proc.-Vis. Image Signul Process., Vol. 146, No. I , February 1999

and their relationship in the context of the decision making process itself.

Depending on its source, the extracted information can be categorised into two principal classes: (i) first order information, and (ii) second order information.

The following Sections elaborate on these two different sources of information and explain how this information can be extracted.

4. I First order information If n classifiers (experts), working on the same problem, deliver a set of classification responses, the decision combination process has to combine the decisions taken by all these different classifiers in such a way that the final overall decision improves the decisions taken by any of the individual experts. Hence, the decision fusion process has to take into account the individual strengths and weaknesses of the different co-operating classifiers to deliver a more robust final decision. Based on the types of classification responses the individual classifiers deliver, different sources of information can be identified. Broadly, three distinctly different sources of information can be identified to exploit a multiple expert classifier combination scheme (Xu et al. [I]). These can be summarised as follows.

Absolute labels: In this case, the co-operating classifiers deliver the classification responses in the form of absolute output labels. Each of the classifiers identifies the character in question definitely as belonging to a particular class and no information other than this assigned label is available. The combination method then has to take a final decision about the true identity of the character based on this information.

Preference lists: In this case, the co-operating classifiers deliver the classification response in the form of a sorted ranking list. Each of the classifiers gives a preference list based on the likelihood of a particular character belonging to a particular class. This response is a special case of responses in the form of absolute labels, the output label in this response being the top choice of the preference response list. In preference lists, much more information is available to determine the final response of the combined classifier than in the case of responses having the absolute labels only.

Sample conjidence index: In this case, the co-operating classifiers deliver the classification response in the form of confidence values. Each of the classifiers gives a preference list based on the likelihood of a particular character belonging to a particular class together with confidence measurement values generated in the origi- nal decision making process. It is seen that a response in the form of preference lists is a special case of a response in the form of sample confidence indices, but a preference list only supplies information about the class labels in order of preference (likelihood) but not incorporating any confidence values. So, in general, it can be said that sample confidence responses are the most generalised form of response, from which both preference lists and absolute labels can be generated. But, as far as the decision combination strategy is con- cerned, sample confidence responses are difficult to uti- lise, as the actual measurement values need to be converted to a normalised scale before any incorporation of information involving a comparison of the individual co-operating classifiers can take place.

These three sources of information are directly derived from the test samples during the recognition

41

phase and therefore are defined as ‘first order information’.

4.2 Second order information As discussed in the previous Section, three different levels of information can be extracted from the test samples. This information can be in the form of absolute labels, as an ordered list of preferences or can be exact confidence indices related to the decisions of the experts. It is also possible to extract additional information about the data diversity, the relative performance indices of the individual experts on specific databases and specific domain information about the characteristics of a particular target dataset. These are collectively defined as ‘second order information’. All this information is a measure of the a priori knowledge about the specified database, the performance of the chosen classification methods on that database and the relative strengths or weaknesses of these methods in dealing with a particular problem domain.

To extract and evaluate these sources of information, it is necessary that an independent evaluating dataset is utilised. So, in this case, the available data needs to be partitioned into three mutually exclusive subsets, one for training, the second for evaluating the information indices and the final subset for testing. The following is a brief description of some important relevant information sources.

Overall confidence index: In general, when two or more experts are compared, they are compared based on the overall recognition rate. This recognition rate is the average of all the individual recognition rates on each individual class in the complete database. Hence, this index gives the overall preference of a particular expert over other competing experts, based on its average performance on all the available classes.

Class confidence values: The overall confidence values give an overall impression about the performance of the different experts. What this particular index fails to provide is a measure of how reliable an expert is in recognising individual classes. It is observed that some experts are more powerful in recognising some classes compared with other experts. Class confidence values provide this information on a class by class basis.

Data diversity: Certain observations can easily be made about the structures of the characters. It is found that some structures are inherently similar, but some are significantly different. As the classification algorithms incorporate some measure of structural features to build up a representative model for a particular character, it is expected that there would be a high degree of confusion in cases of structurally similar characters. The degree of similarity and dissimilarity among the classes define the overall data diversity of the target character set. Extracting quantitative information about the data diversity can be a very powerful tool in designing multiple expert decision combination approaches.

Specijk consideration: It is often possible to exploit particular characteristics of a target set in designing special algorithms when attempting to combine decisions delivered by multiple experts. In some cases, special subgroups or supergroups of classes might be formed to facilitate the decision combination process. This often leads to the design of special purpose dedicated tailored algorithms targeted at particular classes

42

only. Incorporation of this high level information is reflected in the overall decision making process.

5

As already seen, there are seven important performance indices that can be derived from a target database and the co-operating experts participating in the decision combination configuration. These indices are used to form an overall knowledge-base. The first three indices are determined from the test samples during the test phase, the remaining four indices are derived from an evaluating dataset, independent of the training and testing datasets. These represent measures of the relative strengths of the co-operating experts and the data diversity of the target dataset. Together, these seven indices represent the a priori first and second order information accumulated in the knowledge-base. The following Sections elaborate how some of this information is extracted.

Assessing information from various sources

5. I Assigning sample confidence values The sample confidence values a to be associated with every test sample denote the confidence of the expert in question in classifying the pattern to a particular class. The confidence values have to be comparable across the arra.y of experts if they are to be used to optimise the combined decision. Some of the methods for generating sample confidence indices can be found in Ho et al. [2]. As the magnitudes of the responses of different experts vary, different normalisation approaches are adopted. In general, every participating expert assigns a confidence value for every sample it evaluates from every class. So, the total information extracted is stored in a three-dimensional space. olijk therefore denotes the sample confidence value assigned by the ith expert to the kth sample coming from thejth class in the test set (Rahman and Fairhurst [ 3 ] ) .

5. I. I First approach: From the training set m prototypes are created, where m is the number of classes. For each test sample, similarity responses are calculated from each of these prototypes. Assuming the responses to be GI, c2, ..., <,, these are sorted in descending order of magnitude and their maximum e,,,, and minimum cmin occurrences are indexed. The difference of e,,, - <,in is defined as the range, which is then scaled so that all the n responses are normalised to fall within the physical range of 1 to n.

5.1.2 Second approach: In this case, instead of using the confidence values directly, their respective differences from the magnitude of the highest response are used. Again, assuming the responses are e,, c2, ..., &,, in the case of the different prototypes, the values [,,,,, - cr, where r varies from 1 to rn, are calculated. Higher confidence is associated with smaller values of c,iluI ~ cr. Confidence values generated from these patterns are again scaled to a physical range of 1 to n.

5.2 Building preference lists As already seen from the previous Section, if there are m prototypes, similarity responses can be calculated from each of these prototypes. Again, assuming the responses to be cl, r2, ..., en?, they can be sorted in descending order of magnitude and normalised as pointed out in the previous Section. If the normalised responses are ql , q2, ..., q,, the corresponding class

IEE Prot . -Vis lmrzgr Signul Process., Vol. 146. No. 1. February I Y Y Y

labels can be expressed as xl, x2, ..., xm, and this is the preference list that an expert generates when co-operating with other experts on a unified multiple expert framework. This list contains the preference of a particular expert in classifying a particular character to a particular class. In fact, the preference list is a subset of the information generated as sample confidence indices, only in this case the corresponding class labels are used instead of the actual confidence values.

5.3 Assigning absolute labels Once an expert generates the preference list, the top choice from that list can be assigned as the absolute label generated by a particular expert in classifying a particular character. For example, if the preference list is expressed as xI, x2, ..., xm, then the top choice from that list, xtop, is assigned as the absolute label. This information is a subset of the information that is generated as the preference list.

5.4 Assigning class confidence values The class confidence index, & (1 I i 5 n, 1 I j 5 m, where m is the number of classes under consideration) denotes the ranking of the different experts (pU = 1, 2, ..., n) on a class-by-class basis (Rahman and Fairhurst [4]). The higher the class recognition rate, the higher the ranking. Once individual experts are evaluated on the evaluating database, the class confidence indices can be conveniently expressed in the form of a two dimensional array as follows:

/P11 P12 . . . P l m \

5.5 Assigning overall confidence values The overall recognition rate of an expert is the average of all the individual recognition rates achieved on each individual class in the complete evaluation database. If y2 is the number of participating experts, the overall confidence index is expressed by x, 1 I k 5 n, repre- senting the ranking of the experts (yk = 1, 2, ..., n) based on overall recognition rates (Rahman and Fair- hurst [5 ] ) . The higher the recognition rate, the higher the ranking. Once individual experts are evaluated on the evaluating database, the overall confidence indices can be conveniently expressed as a one dimensional matrix in the form of:

/ ? I \ I" Yn J 5.6 Quantifying the data diversity The data diversity primarily measures the degree of closeness or similarity among the samples of different classes that are included in a target dataset. If the samples of the different classes are very dissimilar, the database has a rich data diversity, otherwise there can be a lower degree of data diversity, which might represent cases where the different classes closely resemble each other. It is very important to measure this diversity, since it can be exploited as a very powerful source of information in designing algorithms to combine

IEE Pro( -Vis Image Signal Process, Val 146, No I , Febiuary 1999

decisions by multiple experts. Here, an approach quantifying the closeness between a pair of classes is discussed for illustrative purposes. Although it is primarily targeted for a pair of classes, the approach is generalised and can be extended to any number of classes (Fairhurst and Rahman [6]) .

A generalised method of generating the possible pairs of confusion among the classes under interest is now presented. The generalised confusion matrix CMii can be expressed in the following form:

.'2 . . . CO coe; Co.; Co.; . . ' C l cl.; cl.; q c ; . . '

cb 4 i -- where the column matrix (eo c1 c2 ... denotes the y1 classes under investigation and the row matrix (e6 C'~ c12 ... denotes the identification labels assigned to the different classes which originally belong to the corresponding classes in the other matrix. In this represen- tation, the diagonal matrix (cock c1d1 c2c; ... C~-~C',-J

denotes the cases where the test samples are identified properly. Pairwise misclassification can be calculated by adding the entries in the matrix which occupy positions in the matrix which are mirror positions with respect to the diagonal matrix. For example (clc6 + cot',) gives the confusion measure between the classes 0 and I , respectively. In general, eqn. 1:

/yzj = czc; + .:e, (1) gives the confusion measure between any classes i and j . Normalising this value with respect to the total number of samples classified, either correctly or incor- rectly, gives the normalised confusion measure expressed as in eqn. 2:

X n i j = X i j / ( X Z J + tic: + C j C j ) ( 2 ) This gives a series of xnij values and the matrix xnijij,i=o,l ,... nj=o,l ,... gives the similarity measures among all the classes under investigation. The similarity matrix is in turn normalised by setting the following criterion:

n,

(3) 2>3

The normalised matrix is rearranged in decreasing order of magnitude. The possible pairs of classes which are candidates for re-evaluation are determined by applying the following criterion of approximation:

n

(4) w

where 0 denotes a threshold value depending on the type of approximation demanded from the system.

The dissimilarity measures are easily obtained from the matrix x ~ g ~ ~ = ~ , ~ , , , , I I;j=o,l ,,,, by ordering the elements in increasing order and setting a corresponding bound- ary function. Although this framework focuses on the pairwise confusion matrix, confusion between any number of classes can easily be obtained from this matrix. When evaluating the dissimilar pairs/triplets etc., it is equally straightforward to find them from the confusion matrix xnij;j,j=o,l ,,,, n;j=o,l ,,,, n , only this time the selection criterion is reversed.

43

5.7 Assessing data-specific information As already mentioned, it is often possible to form special sub-groups or super-groups of classes to facilitate the decision combination process. Specific information about possible sub-group formation can be exploited by designing multiprototype classifiers in the framework of multiple expert decision combination (Rahman and Fairhurst [7]). On the other hand, information extracted about possible super-group formation can lead to splitting up a large problem into multiple smaller problems. For example, if it is known that some of the classes under consideration are more closely related to each other than other classes, it is often possible to exploit this information in designing the dataflow in such a way so that the closely resembling classes are considered together and a super-group is formed. This often leads to the design of special purpose dedicated tailored algorithms targeted at particular classes only (Rahman and Fairhurst [SI). In general, this type of data specific information is often exploited in the way the overall decision combination process is designed.

6 Exploitation of extracted information

As already discussed, three different levels of information can be extracted from the test samples. This information can be in the form of absolute labels, as an ordered list of preferences or can be exact confidence indices related to the decisions of the experts. This whole array of information was collectively defined as first order information. It is also possible to extract additional information about the data diversity, the relative performance indices of the individual experts with respect to specific databases and specific domain information about the characteristics of a particular target dataset. These sources were collectively defined as generating second order information.

Based on the discussion presented so far, it is clear that decision combination of multiple experts needs to consider the different types of classifiers and their individual responses before a suitable combination scheme can be formulated. To design a really powerful and robust character recognition system, appropriate incorporation of first order and second order information is essential. A sensible approach in designing such multiple expert decision combination algorithms is therefore, to build a knowledge-base consisting of first and second order information from the various sources, for example, the target database, the co-operating experts etc. Since successful exploitation of such a knowledge- base should indicate the degree of accuracy and robust- ness in the effectiveness of the multiple expert combination scheme, it is therefore helpful to investigate how and to what extent the available knowledge is built into the decision making process. The following Sections investigate some multiple expert decision combination approaches found in the literature in terms of this information management issue and explore how successful usage of this information is linked to the capability of these multiple expert approaches to deliver an optimised decision based on the decisions delivered by the co-operating classifiers.

6. I combination methods Some multiple expert decision combination methods have been selected to investigate the information man-

Selected multiple expert decision

44

agement issues discussed in the context of multiple expert decision combination strategies as applied to the recognition of characters. These methods include the following:

the aggregation method (Ho et al. [9], Hull et al. [lo]), the ranking method (Mazurov et al. [l 11 and Ho et al.

the behaviour knowledge space method (Huang and Suen [12,]),

the majority voting scheme (Kittler and Hatef [13]), serial combination method (Rahman and Fairhurst

parallel combination method (Rahman and Fairhurst

hybrid combination method (Rahman and Fairhurst

6.2 Selected independent experts To compare the performances of different multiple expert configurations, it is important to have a group of experts, which have comparable inter-expert performance indices, but which, at the same time, use different types of features and classification criteria. The following experts were chosen to provide a basis for the exploration of the operation of various integrated multiple expert systems.

Binary weighted scheme ( B W S ) : This employs a tech- nique based on n-tuple sampling or memory network processing (Fairhurst and Stonham [16]). The image array is divided into a certain number of samples, each consisting of a fixed number of pixels. Each of these samples is connected to a memory element, which, in turn, computes a single valued Boolean function.

Frequency weighted scheme ( F W S ) : This is similar to the BWS, but in this case the memory elements calcu- late the relative frequencies of the sampled features, thereby indicating the probability distribution of the group of points or n-tuples (Fairhurst and Mattaso Maia [17]).

Multilayer perceptron network (MLP): This is the standard multilayer perceptron neural network structure, employing the standard error backpropagation algorithm (Rumelhart et al. [18]).

Moment-based pattern classifiers ( M P C ) : These statistical algorithms make use of the nth order mathematical moments derived from the binarised patterns. Different discriminating functions may be used to iden- tify possible cluster formation (Reiss [19]). Among these are the Euclidean distance, the Mahalanobis distance and the maximum likelihood discriminator.

[91),

[I 411,

[3l), and

[151).

6.3 Selected databases Three databases have been chosen in all the experi- ments and simulations discussed here. Two of these databases contain handwritten characters and the third consists of machine printed characters. The first handwritten database is one compiled at the University of Essex, UK (database A) and some typical examples from this database are presented in Fig. 1 [20]. The second database (database B) is compiled by the US National Institute of Standards and Technology, and is popularly known as the NIST Database [21]. Some typical examples from this database are presented in Fig. 2. The third is compiled at the University of Kent at Canterbury, UK (database C) and some typical examples from this database are presented in Fig. 3

IEE Proc -Vis Image Signal Process Vol 146 N o 1. Februarj 1999

[22]. This is a compilation of printed characters and was extracted from machine printed post-codes sup- plied by the British Post Office. All these databases contain samples of alpha-numeric characters (the numerals 0 to 9 and upper case letters A to Z, with no distinction made between the characters ' 1/I' and '01 0 7 ) .

Fig. 1 Some typical examples from reference handwritten database A

Fig.2 Some typical examples from reference handwritten database B

QWSS 2P3K 4G58

Fig. 3 C

Some typical examples from reference machine-printed database

Table 1: Performance comparison of the different experts on different databases

Optimum recognition rate

Digit + upper case letters

Database Type of expert Digit classes

C BWS

FWS

MPC

MLP A BWS

FWS

MPC MLP

B BWS

FWS

M PC

MLP

98.24 '98.48 96.80 98.34 88. I 8 "92.36 90.50 92.17 73.57 79.49 "86.29 84.86

95.70 '97.25 94.08 96.25 72.05 79.39 80.00 "81.07 66.27 72.34 '73.95 72.86

6.4 Relative overall performance Since the objective of this paper is to explore possible relationships between the success of different multiple expert decision combination schemes and the extent of information usage by these combination approaches, it is useful to implement the selected multiple expert

IEE Proc- Vis. Image Signal Process., Vol. 146, No. 1, February I999

approaches and evaluate their performance on the chosen databases. Before an attempt is made to combine multiple experts using the various approaches considered in this paper, it is important that the performances of the individual experts on the selected databases are evaluated. Table 1 presents the performance of the various individual experts on the various databases. For database C, in the case of the numeric dataset, the best performance was achieved by the FWS network. The MLP network was the next best, closely followed by the BWS network. The MPC network performed the worst. In the case of the full alpha-numeric dataset, similar results were obtained. The FWS expert performed best, followed by the MLP and the BWS. In this case, the worst performance was again obtained by the MPC.

Table 1 also presents the performance of the different experts on database A. In this case, the best performance was provided by the FWS classifier, followed by the MLP and MPC for the numeric dataset. In this case, the BWS provided the worst performance. In the case of the complete alpha-numeric database, the best performance was provided by the MLP, followed by the MPC and FWS. The BWS performed poorly in this case.

Finally, Table 1 also includes the performance of the different experts on database B. Here, the best expert was the MPC in both the numeric and alpha-numeric case. For the numeric case, the MPC was followed by the MLP and FWS. BWS again provided the worst performance. In the case of the alpha-numeric case, the MPC was again followed by the MLP and FWS. The BWS again provided the worst performance.

Tables 2, 3 and 4 present the comparative performances of the various multiple expert approaches on the three databases respectively. It is observed that in the case of database C, the serial, parallel and hybrid approaches are better than all the other approaches while considering the digit classes only, while in the case of database A, this observation is not true. In this case, although the two stage combination approach is the best performing approach, the behaviour knowledge space method outperforms both the serial and the parallel approach. On the other hand, the serial and parallel approaches are better than the rest of the decision combination approaches. Similar observations are also true for the case of database B. In this case, for the digit classes, the behaviour knowledge space method is better than the serial approach, but performs worse than parallel approach. For the digit plus upper case letters, the behaviour knowledge method is better than both the serial and the parallel approach.

In general, it has been found that behaviour knowledge space method is the best performing approach among the first four selected methods, which have been reported by various researchers. The ranking method is the next best performing method, followed by the majority voting scheme and the aggregation method. On the other hand, among the last three methods which have been reported, the hybrid method is the best, followed by the parallel method and the serial method. But when these methods are compared with the other selected methods, the serial method is sometimes worse than some of the selected methods, but in most cases, the parallel method is better than all the selected methods except the behaviour knowledge space method. Undoubtedly, however, the hybrid method

45

Table 2: Performance comparison among different configurations on Database C

Best version Overall performances

Digit classes Digits plus upper case Digit classes Digits plus upper case Type of algorithm

Aggregation method M PC-BWS-MLP-FWS M PC-BWS-M LP-FWS 98.66 97.53

Ranking method M PC-BWS-M LP-FWS M PC-BWS-MLP-FWS 98.73 97.92

Behaviour knowledge space method MPC-BWS-MLP-FWS MPC-BWS-MLP-FWS 98.81 98.01

Majority voting scheme M PC-BWS-M LP-FWS M PC-BWS-MLP-FWS 98.7 1 97.75

Serial M PC-BWS-M LP-FWS M PC-BWS-MLP-FWS 98.99 98.21

Parallel MPC-BWS-MLP-FWS MPC-BWS-MLP-FWS 98.98 97.98

Hybrid single stage MPC-MLP-MLP-FWS MPC-MLP-MPC-FWS 99.06 98.16

Hvbrid two staae MPC-MLP-MPC-FWS MPC-MLP-MPC-FWS 99.14 98.61

Table 3: Performance comparison among different configurations on Database A

Type of algorithm

Aggregation method

Ranking method

Behaviour knowledge space method

Majority voting scheme

Serial

Parallel

Hybrid single stage

Hybrid two stage

Best version

Digit classes

M PC-BWS-M LP-FWS

M PC-BWS-M LP-FWS

M PC-BWS-M LP-FWS

M PC-BWS-M LP-FWS

BWS-M PC-M LP-FWS

BWS-M PC-M LP-FWS

M PC-M LP-MLP-FWS

MPC-MLP-MLP-FWS

Digits plus upper case

MPC-BWS-MLP-FWS

MPC-BWS-MLP-FWS

M PC-BWS-M LP-FWS

M PC-BWS-M LP-FWS

BWS-FWS-M PC-M LP

BWS-MPC-MLP-FWS

M PC-M LP-MLP-FWS

M PC-M LP-MLP-FWS

Overall performances

Digit classes

92.92 81.85

94.1 1 82.51

94.56 83.86

93.42 82.34

93.31 82.77

94.43 84.72

93.41 83.19

96.8 84.91


Table 4: Performance comparison among different configurations on Database B

Type of algorithm

Aggregation method

Ranking method

Behaviour knowledge space method

Majority voting scheme

Serial

Parallel

Hybrid single stage Hvbrid two staae

Best version

Digit classes

MPC-BWS-MLP-FWS

MPC-BWS-MLP-FWS

MPC-BWS-MLP-FWS

MPC-BWS-MLP-FWS BWS-FWS-MLP-MPC

BWS-FWS-MLP-M PC

FWS-M LP-M LP-MPC FWS-M LP-M LP-MPC


M PC-BWS-M LP-FWS

MPC-BWS-MLP-FWS

MPC-BWS-MLP-FWS

MPC-BWS-MLP-FWS

BWS-FWS- MLP-M PC

BWS-FWS-MLP-MCP

FWS-MLP-MLP-MPC FWS-MLP-M PC-M PC

Overall performances

Digit classes Digits plus upper case

87.56 77.92

89.1 1 79.12

89.34 79.44

88.84 78.29

88.41 76.02

90.91 79.18

92.10 81.43 92.31 82.19

employing a two stage implementation is always better that all the selected methods. Other hybrid methods employing a single stage configuration, on the other hand, also outperform the selected methods in most cases, but not in all cases.

It is observed from the results that decision combination always produces a performance enhancement. The degree of enhancement depends on the decision combination strategy adopted. It is also observed that the enhanced performance is in no way dependent on the modification of the experts themselves, rather, it is the way in which the individual decisions are combined which is the most important factor. All these decision combination strategies implement some sort of algorithm to arrive at a consensus. In the case of some decision combination strategies, the decision combiner is a passive expert. But, in the case of the hybrid decision combination schemes presented here, there is no passive decision combiner. Rather, all the experts in the hierarchy are active and deliver some type of decision. That is why these structures are completely active structures and the choice of a particular sequence of

46

experts is of paramount interest. It is seen that, although all the combination schemes offer performance enhancement, the hybrid schemes have a significant advantage over the other schemes in terms of flexibility of design and modularity. The experts at the different hierarchical positions can be easily exchanged with some other expert appearing elsewhere in the hierarchy, and such a modification can lead to significant changes in the behaviour of the combined classification scheme. In this sense, effort has to be directed to achieve an overall system optimisation.

6.5 Extent of information usage It was suggested earlier that the efficient usage of extracted information collected from various sources might be a very strong indicator of how efficiently a multiple expert decision combination approach would perform in practical situations. The discussion presented in the previous Section has evaluated the comparative success of the selected multiple expert decision combination approaches on the chosen databases. In this Section, the issues concerning efficient exploitation

IEE Proc.-Vi.s. Image Signal Procar . , Vol. 146, No . 1. February I Y Y 9

of the collected information in relation to the relative success of these decision combination frameworks is explored.

Table 5 identifies the first order information sources that can be extracted from the test samples. It shows that the aggregation method utilises two, the ranking method all three, the behaviour knowledge space method one and the majority voting scheme one of the three possible sources of information. The aggregation method uses absolute labels and sample confidence indices, the ranking method uses absolute labels, preference lists and sample confidence indices and the behaviour knowledge space method uses only absolute labels.

On the other hand, Table 6 identifies the second order information sources that can be extracted from various secondary sources. In this case, the aggregation method uses two, the ranking method two, the behaviour knowledge space method three and the majority voting scheme only one of the four possible sources of information. In comparison, the hybrid method utilises all four of these information sources. This demonstrates the extent of incorporation of secondary information in the decision making process by this hybrid method.

In the series combination approach, two out of the three possible first order information sources and one of the four possible second order information sources have been exploited. It utilises absolute labels and preference lists to determine how the information is trans- ferred within the various experts. On the other hand, secondary information about relative overall strengths of the experts is used in deciding the hierarchy of the experts (i.e. the order in which they appear in the serialised framework).

The parallel combination approach, by comparison, exploits all three possible sources of first order information and two of the four possible sources of second order information. All of this information is exploited when combining the individual experts which play a role in the decision making process. Although the decision hierarchy is able to exploit all these various sources of information, it does not use the two vital sources of second order information which relate to the information about the target data diversity and the specific class-directed considerations respectively.

The hybrid approach uses information about absolute labels, but does not use the other two sources of first order information. Hence, it uses all the four possible sources of second order information. This approach is very different from the approaches taken by the serialised and parallel schemes. Since it adopts a relatively sophisticated class-directed approach based on second order information extracted from the data diversity and various relative performance indices, it is able to exploit a substantial proportion of the first order and second order information that is available from the various sources.

Overall, it is seen that some of the multiple expert methods considered are very efficient in incorporating first and second order information in their decision hierarchy, and some are much less so. It is also evident, however, that there is a direct relationship between the exploitation of the available information sources and the overall success of the decision combination frameworks. It is found that decision combination strategies exploiting additional relevant information have greater success in delivering more robust performances. Hence it is entirely possible to treat the comparative evaluation of -multiple expert decision

Table 5: Comparison among different configurations with respect to the selected methods: exploitation of extracted first order information

First order information

Absolute Preference Sample confidence labels lists indices a

Type of combination approach

- Aggregation J J

Ranking J J J

Behaviour knowledge space method J - - Majority voting J - -

Serial J J - Parallel J J J

Hybrid J - -

combination

Table 6: Comparison among different configurations with respect to the selected methods: exploitation of extracted second order information

Second order information

Relative strengths of experts Specific

Class Overall (e.g. tailored confidence p confidence y matrix) algorithms)

Data diversity (e.g. confusion

consideration Type of combination approach

- Aggregation J J - Ranking J J - -

Behaviour knowledge space method J J J

Majority voting Serial

Parallel

Hvbrid

- -

- J

J J

J J

IEE Proc.- Vis. Image Signal Process., Vol. 146, No. 1, February 1999 41

approaches based on their capability of exploiting first and second order information extracted from the various sources as a yardstick in estimating the level of performance that is achievable from these combined configurations. It is evident from the discussion so far that the hybrid approach is by far the best combination scheme in terms of exploiting the important sources of information, followed by the parallel approach, the behaviour knowledge space method and the ranking method with the remaining approaches likely to be gen- erally less effective.

7 Discussion

It has been demonstrated that efficient usage of extracted information is a very powerful indicator of the level of performance achievable from a multiple expert combination configuration. It is also apparent that the important factor in the success of such combination algorithms is how effectively the extracted information is exploited in the decision making process and how much of this information is reflected in designing the detail of the combined configuration. It is clear, however, that the issue of information incorporation is an important factor in dictating the structural complexity of various multiple expert decision combination approaches, since they either implicitly or explicitly reflect the incorporation of extracted a priori information in the decision combination framework. Structural complexity, on the other hand, is directly associated with the number of processing elements that a particular configuration employs in the decision making process.

The extent of information extraction and its incorporation in the combined decision making process is therefore reflected in how many active and passive processing elements are employed by a decision combination framework, but it is also understood that employment of additional processing elements can be a powerful indicator of the ordering of the decision combination process. This concept can be easily extended in assuming that the incorporation of first and espe- cially second order information is reflected not only in the number of processing elements, but more so in how the physical structure of the combined framework is designed. The issues concerning the complexity and ordering of the physical layout can be investigated by analysing the various multiple expert decision combination approaches in terms of a decision combination topology previously reported [23]. It is seen that each of the selected multiple expert decision combination strategies have different levels of complexity associated with their physical structure and how the various experts communicate with each other. It is also seen that, as more and more additional information is incorporated in the decision making process, the structural complexity also increases, as is evident from evaluating the hybrid method and the behaviour knowledge space method. This demonstrates that the design of multiple expert classifier configurations can be streamlined by classifying these structures in terms of how the chan- nels used for carrying information among different experts are interconnected irrespective of the algorithms used by co-operating experts and by the final decision combination expert, which is a direct conse- quence of incorporation of additional a priori information.

8 Conclusions

This paper has presented an in-depth investigation of the implications of exploiting a priori information from various sources and incorporating that information in the multiple expert decision combination strategies. Incorporation of the extracted information plays a piv- otal role in the success of any multiple expert decision combination strategy. It has been demonstrated that it is possible to successfully incorporate this information in the design of classifier combination strategies and the implications of this have been investigated in detail. This leads to modifications in the physical structure, the design of interconnecting information exchange pathways, the number of processing elements, the throughput characteristics and above all on the overall recognition performances achievable from these combined configurations. Seven different multiple expert decision combination frameworks have been studied in terms of these characteristics. It has been demonstrated that incorporation of additional a priori information can lead to very successful multiple expert decision combination algorithms and it is also demonstrated that the extent of available information usage is a very good indicator of the possible level of performance that is achievable from these decision combination frameworks. This observation leads to the conclusion that, before designing new multiple expert decision combination strategies, it is of paramount importance to evaluate the extent to which possible sources of information are exploited in the decision making process, since the success of a new decision combination framework would certainly depend on successfully incorporating this information.

9

1

2

3

4

5

6

7

8

9

References

XU, L., KRZYZAK, A., and SUEN, C.Y.: ‘Methods of combining multiple classifiers and their applications to handwriting recognition’, IEEE Trms. Syst. Man Cyhern., 1992, 23, ( 3 ) , pp. 418- 435 HO, T.K., HULL, J.J., and SRIHARI, S.N.: ‘On multiple classifier systems for pattern recognition’. Proceedings of 1 lth ICPR, The Hague, Netherlands, 1992, pp. 84-87 RAHMAN, A.F.R., and FAIRHURST, M.C.: ‘Exploiting second order information to design a novel multiple expert decision combination platform for pattern classification’, Elecrr. Lett., 1997: 33, (6), pp. 47W77 RAHMAN, A.F.R., and FAIRHURST, M.C.: ‘An evaluation of multi-expert configurations for recognition of handwritten numerals’, Putt. Recog., 1998, 31, (9), pp. 1255-1273 RAHMAN, A.F.R., and FAIRHURST, M.C.: ‘Machine-printed character recognition revisited: re-application of recent advances in handwritten character recognition research’, Image Vis. Comp., 1998: 16, (12, 13), pp. 819-842 (Special issue on Document image processing and multimedia environments) FAIRHURST, M.C., and RAHMAN, A.F.R.: ‘A new multi- expert architecture for high performance object recognition’, Proc. SPIE - Int. Soc. Opt. Eng., 1996, 2908, pp. 140-151 RAHMAN, A.F.R., and FAIRHURST, M.C.: ‘Multi-prototype classification: Improved modelling of the variability of handwritten data using statistical clustering algorithms’, Electron. Lett., 1997; 33, (14), pp. 1208-1209 RAHMAN, A.F.R., and FAIRHURST, M.C.: ‘Selective parti- tion algorithm for finding regions of maximum pairuise dissimilarity among statistical class models’, Putt. Recog. Lett., 1997, 18, (7), pp. 605-61 1 HO, T.K., HULL, J.J., and SRIHARI, S.N.: ‘Decision combination in multiple classifier systems’, IEEE Trans. Pattern Anal. Much. Intell.. 1994. 16. (1). DD. 6675

1 1 I j I I

10 HULL, J.J.,’COMMIKE, A.A., and HO, T.K.: ‘Multiple algorithms for handwritten character recognition’. Proceedings of 1st international workshop on Frontiers in handwriting recognition, Montreal, Canada, 1990, pp. 117-124

11 MAZUROV, V.D., KRIVONOGOV, A.I., and KAZANT- SEV, V.L.: ‘Solving of optimisation and identification problems by the committee methods’, Putt. Recog., 1987, 20, (4), pp. 371- 378

48 IEE Proc-Vis. Image Signal Process., Vol. 146. No. I , February 1999

12 HUANG, Y.S., and SUEN, C.Y.: ‘A method of combining multiple experts for the recognition of unconstrained handwritten numerals’, IEEE Trans. Pattern Anal. Mach. Intell., 1995, 17, (l), pp. 90-94

13 KITTLER, J., and HATEF, M.: ‘Improving recognition rates by classifier combination’. Proceedings of the 5th international workshop on Frontiers of handwriting recognition, 1996, pp. 81-102

14 RAHMAN, A.F.R., and FAIRHURST, M.C.: ‘A multiple- expert decision combination strategy for handwritten character recognition’. Proceedings of the international conference on Com- putational linguistics, speech and document processing, Calcutta, India, 1998, pp. A23-A28

15 FAIRHURST, M.C., and RAHMAN, A.F.R.: ‘A generalised approach to the recognition of structurally similar handwritten characters’, IEE Proc.. Vis., Image Signal Process., 1997, 144, (l), pp. 15-22

16 FAIRHURST, M.C., and STONHAM, T.J.: ‘A classification system for alphanumeric characters based on learning network techniques’, Digit. Process., 1976, 2, pp. 321-339

17 FAIRHURST, M.C., and MATTASO MAIA, M.A.G.: ‘Per- formance comparison in hierarchical architectures for memory network pattern classifiers’, Puttern Recog. Lett., 1986, 4, (2), pp. 121-1 24

18 RUMELHART, D.E., HINTON, G.E., and WILLIAMS, R.J.: ‘Learning internal representations by error propagation’, in RUMELHART, D.E., and MCCLELLAND, J.L. (Eds.): ‘Paral- lel distributed processing’ (MIT Press, Cambridge, MA, 1986), vol. I, pp. 318-362

19 REISS, T.H.: ‘Recognizing planer objects using invariant image features’ (Springer-Verlag, Berlin, Heidelberg, Germany, 1993)

20 LUCAS, S., and AMIRI, A.: ‘Statistical syntactic methods for high-performance OCR’, IEE Proc., Vis., Image Signal Process.,

21 NIST: Special databases 1-3, 6-8, 19, 20’ (National Institute of Standards and Technology, Gaithersburg, MD 20899, USA)

22 Image Processing Computer Vision: Electronic Engineering, Uni- versity of Kent, Canterbury CT2 7NT, UK

23 RAHMAN, A.F.R., and FAIRHURST, M.C.: ‘Introducing new multiple expert decision combination topologies: a case study using recognition of handwritten characters’. Proceedings of the international conference on Document analysis and recognition, ICDAR97, Ulm, Germany, 1997, Vol. 2, pp. 886- 89 1

1996, 143, (l), pp. 23-30

IEE Proc.-Vis. Image Signal Process., Vol. 146. No. I , February 1999 49

Date post:	21-Sep-2016
Category:	Documents
Upload:	mc
View:	218 times
Download:	6 times

Enhancing multiple expert decision combination strategies through exploitation of a priori...

Documents