+ All Categories
Home > Documents > A method for mineral prospectivity mapping integrating C4.5 decision tree, weights-of-evidence and...

A method for mineral prospectivity mapping integrating C4.5 decision tree, weights-of-evidence and...

Date post: 19-Dec-2016
Category:
Upload: ze
View: 212 times
Download: 0 times
Share this document with a friend
12
RESEARCH ARTICLE A method for mineral prospectivity mapping integrating C4.5 decision tree, weights-of-evidence and m-branch smoothing techniques: a case study in the eastern Kunlun Mountains, China Cuihua Chen & Binbin He & Ze Zeng Received: 9 February 2013 /Accepted: 16 July 2013 /Published online: 17 August 2013 # Springer-Verlag Berlin Heidelberg 2013 Abstract In this study, a novel method that integrates C4.5 decision tree, weights-of-evidence and m-branch smoothing techniques was proposed for mineral prospectivity mapping. First, a weights-of-evidence model was used to rank the importance of each evidential map and determine the optimal buffer distance. Second, a classification technique that uses a C4.5 decision tree in data mining was used to construct a decision tree classifier for the grid dataset. Finally, an m- branch smoothing technique was used as a predictor, which transformed the decision tree into a probability evaluation tree. The method makes no conditional independence assump- tion and can be applied for class imbalanced datasets like those collected during mineral exploration for prospectivity mapping of an area. The traits of comprehensibility, accuracy and efficiency were derived from the C4.5 decision tree. In addition, a case study for iron prospectivity mapping was performed in the eastern Kunlun Mountains, China. Sixty- two Skarn iron deposits and eight evidential maps related to iron mineralization were studied. On the final map, areas of low, moderate and high potential for iron deposit occurrence covered areas of 71,491, 14,298, and 9,532 km 2 , respectively. For the goodness-of-fit test, 91.94 % of the total 62 iron deposits were within a high-potential area, 8.06 % were within a moderate-potential area and 0 % were within a low-potential area. For ten-fold cross-validation, 82.26 % were within a high-potential area, 14.52 % were within a moderate- potential area and 3.22 % were within a low-potential area. To evaluate the predictive accuracy, Receiver Operating Characteristic (ROC) curves and Area Under ROC Curve (AUC) were employed. The accuracy of the goodness-of-fit test reached 97.07 %, and the accuracy of the ten-fold cross- validation was 95.10 %. The majority of the iron deposits were within high-potential and moderate-potential areas, which covered a small proportion of the study area. Keywords Eastern Kunlun Mountains . Mineral prospectivity mapping . C4.5 decision tree . M-branch smoothing . Weights-of-evidence model Introduction In recent years, mineral prospectivity analysis and quantitative resource estimation have been recognized as important when integrating multi-source geology spatial data (Porwal and Kreuzer 2010). The statistical and mathematical approaches developed recently for multi-resource geological spatial data integration include: weights-of-evidence (WofE) ( Bonham- Carter et al. 1989; Agterberg 1992; Harris et al. 2000; Carranza 2004; Daneshfar et al. 2006; He et al. 2010a, b; Porwal et al. 2010),logistic regression (Agterberg et al. 1993; Carranza and Hale 2001), fuzzy logic (Aminzadeh 1994; Luo and Dimitrakopoulos 2003), artificial neural networks (Koike et al. 2002; Rigol-Sanchez et al. 2003) and the fractal method (Gumiel et al. 2010). Recently, data mining and machine learn- ing techniques were introduced into mineral prospectivity map- ping, including a priori association rules mining (Cui et al. 2010; He et al. 2011), case-based reasoning (He et al. 2012), and support vector machines (SVMs) ( Zuo and Carranza 2011; Abedi et al. 2012). Communicated by: H. A. Babaie C. Chen College of Geosciences, Chengdu University of Technology, Chengdu, Sichuan 610059, China B. He (*) : Z. Zeng School of Resources and Environment, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, China e-mail: [email protected] Earth Sci Inform (2014) 7:1324 DOI 10.1007/s12145-013-0128-0
Transcript

RESEARCH ARTICLE

A method for mineral prospectivity mapping integrating C4.5decision tree, weights-of-evidence and m-branch smoothingtechniques: a case study in the eastern Kunlun Mountains,China

Cuihua Chen & Binbin He & Ze Zeng

Received: 9 February 2013 /Accepted: 16 July 2013 /Published online: 17 August 2013# Springer-Verlag Berlin Heidelberg 2013

Abstract In this study, a novel method that integrates C4.5decision tree, weights-of-evidence and m-branch smoothingtechniques was proposed for mineral prospectivity mapping.First, a weights-of-evidence model was used to rank theimportance of each evidential map and determine the optimalbuffer distance. Second, a classification technique that uses aC4.5 decision tree in data mining was used to construct adecision tree classifier for the grid dataset. Finally, an m-branch smoothing technique was used as a predictor, whichtransformed the decision tree into a probability evaluationtree. The methodmakes no conditional independence assump-tion and can be applied for class imbalanced datasets likethose collected during mineral exploration for prospectivitymapping of an area. The traits of comprehensibility, accuracyand efficiency were derived from the C4.5 decision tree. Inaddition, a case study for iron prospectivity mapping wasperformed in the eastern Kunlun Mountains, China. Sixty-two Skarn iron deposits and eight evidential maps related toiron mineralization were studied. On the final map, areas oflow, moderate and high potential for iron deposit occurrencecovered areas of 71,491, 14,298, and 9,532 km2, respectively.For the goodness-of-fit test, 91.94 % of the total 62 irondeposits were within a high-potential area, 8.06%were withina moderate-potential area and 0 % were within a low-potentialarea. For ten-fold cross-validation, 82.26 % were within a

high-potential area, 14.52 % were within a moderate-potential area and 3.22 % were within a low-potential area.To evaluate the predictive accuracy, Receiver OperatingCharacteristic (ROC) curves and Area Under ROC Curve(AUC) were employed. The accuracy of the goodness-of-fittest reached 97.07 %, and the accuracy of the ten-fold cross-validation was 95.10 %. The majority of the iron depositswere within high-potential and moderate-potential areas,which covered a small proportion of the study area.

Keywords EasternKunlunMountains .Mineral prospectivitymapping . C4.5 decision tree .M-branch smoothing .

Weights-of-evidencemodel

Introduction

In recent years, mineral prospectivity analysis and quantitativeresource estimation have been recognized as important whenintegrating multi-source geology spatial data (Porwal andKreuzer 2010). The statistical and mathematical approachesdeveloped recently for multi-resource geological spatial dataintegration include: weights-of-evidence (WofE) ( Bonham-Carter et al. 1989; Agterberg 1992; Harris et al. 2000;Carranza 2004; Daneshfar et al. 2006; He et al. 2010a, b;Porwal et al. 2010),logistic regression (Agterberg et al. 1993;Carranza and Hale 2001), fuzzy logic (Aminzadeh 1994; Luoand Dimitrakopoulos 2003), artificial neural networks (Koikeet al. 2002; Rigol-Sanchez et al. 2003) and the fractal method(Gumiel et al. 2010). Recently, data mining and machine learn-ing techniques were introduced into mineral prospectivity map-ping, including a priori association rules mining (Cui et al.2010; He et al. 2011), case-based reasoning (He et al. 2012),and support vector machines (SVMs) ( Zuo and Carranza 2011;Abedi et al. 2012).

Communicated by: H. A. Babaie

C. ChenCollege of Geosciences, Chengdu University of Technology,Chengdu, Sichuan 610059, China

B. He (*) : Z. ZengSchool of Resources and Environment, University of ElectronicScience and Technology of China, Chengdu, Sichuan 611731, Chinae-mail: [email protected]

Earth Sci Inform (2014) 7:13–24DOI 10.1007/s12145-013-0128-0

Among all the data mining tools, the C4.5 decision tree hasbeen the most commonly used due to its transparency, accu-racy and efficiency (Wu et al. 2008). It has become a bench-mark to which newer supervised learning algorithms are oftencompared. The C4.5 decision tree has been used for classifi-cation (returns a discrete type of output class) in many appli-cation areas but has rarely been used for prediction (returnscontinuous values). Hwang et al. (2009) modeled landslideprediction as a two-class classifier. They employed the C4.5technique to classify a trained dataset into two categories,landslide and no landslide, to predict landslide locations. Infact, Hwang’s work focused on classifying the study areas, notpredicting, because we cannot know which same-categorylocations are at a greater risk of experiencing future landslides.A probability-based ranking of a smoothing technique cantransform a constructed Decision Tree Classifier (DTC) to aProbabilistic Estimation Tree (PET). Then, predictions can bemade by the PET. Yeon et al. (2010) applied decision treeswith smoothing techniques for mapping landslide susceptibil-ity and achieved acceptable results.

The objective of this paper is to describe a novel method ofdeveloping mineral prospectivity maps that integrate C4.5decision tree, weights-of-evidence and m-branch smoothingtechniques. The decision tree was employed to overcome theshortcomings of the statistical assumptions required by theweights-of-evidence and logistic techniques. The m-branchsmoothing technique was used to transform the decision treeinto a probabilistic estimation tree. The weights-of-evidencemodel was used to rank the importance of each evidential mapand determine the optimal buffer distance for the variousgeological predictive maps.

The main advantages of the proposed methodology is that:(1) Unlike other statistical methods, the decision tree techniquemakes no conditional independence assumption, consequentlybetter predictive accuracy can be expected without these as-sumptions. Thus, more evidential maps related to mineraliza-tion can be selected for modeling mineral prospectivity. (2)Another advantage of the methodology presented is its suitabil-ity for use with class-imbalance datasets that are common inmineral exploration. The use of the SVM method for pros-pectivity mapping is limited as a uniform distribution of posi-tive and negative class data is assumed, which is not usually thecase in mineral exploration because areas without deposits aremuch larger than those with deposits. A decision tree over-comes this problem, allowing global mineral prospectivitymaps to be developed.

Geology characteristics of Skarn iron deposit in the studyarea

The eastern Kunlun Mountains, in the Qinghai Province,China, which covers 95,321 km2 ,was chosen as the study

area. It is located between the parallels 90°31′E and 100°04′ Eand the meridians 34°57′N and 37°56′ N. Sixty-two Skarniron deposits were subset from a database of 81 iron depositsfor the experiment. The location of the eastern KunlunMountains are shown in Fig. 1, as well as the distribution ofknown Skarn iron deposits, which are denoted as black roundpoints.

The selected 62 skarn iron deposits (Fig. 1) are spatiallyassociated with alkaline series intrusions that usually havehigh concentrations of potassium and calcium. The iron min-eralization occurs in carbonate, silicate, altered marble, as wellas skarn altered marble from lithologies belonging to theOrdovician Tanjianshan group. Ore bodies of iron are foundwithin contact zones of intrusive rock especially the Varsican-Indosinian Moyite and Tanjianshan Group. The main hostrocks are a series of basic mafic volcanic rocks, carbonateand clastic rocks from the Ordovician Tanjianshan Group. isthe main lithologies are generally composed of sandstone,argillite and marble. The main alteration of the host-rockincludes silicification, carbonate alteration or iron skarn min-eral assemblages. is The main ore-forming period occurredduring the Indonision-Early Yanshanian related to hydrother-mal alternation associated with igneous activity. The maincontrols on mineralization are consequently intrusions, car-bonate host rocks andnorthwest faults, including stratigraphiccontrolled fractures.

The following evidential datasets were used for Skarn irondeposit modelling based on the understanding of Skarn ironformation as well as the general geological characteristics ofmineralization in eastern Kunlun described above. Extensiveregion-wide felsic igneous intrusive activity has occurredsince the Proterozoic. This intrusive activity when spatiallyassociated with carbonate rich host rocks has resulted incontrolling the formation of iron mineralization through hy-drothermal processes, lithological controls and structural con-trols. Indosinian k-feldspar granite is a major component ofthe control on mineralization providing heat, hydrothermalfluids and metal. Thus, the granite is one type of evidence.Regional faults are one of the most important features thatsubsequently influence hydrothermal activity and ore forma-tion. Thus, investigation of the relationship between faults andiron mineralization is another step in understanding the min-eralization process and distribution of hydrothermal or irondeposit. The igneous intrusions that control mineralizationmay themselves be spatially associated with faults or crustaldiscontinuities. Pooling of a large volume of magma fromdepths far beneath the Earth’s crust in shallow crustal magmachambers may result in the formation of hydrothermal mineraldeposits, and affect ultimately the potential for ore formation.Therefore, fault lines of various sizes were used to test theirspatial relationships to iron mineralization. Skarn iron depositsare also associated with carbonate rich host rocks, which whenaltered by hydrothermal fluids from the granites host skarn iron

14 Earth Sci Inform (2014) 7:13–24

mineralization. Consequently, the relationship between carbon-ate lithologies in the Tanjianshan groupwas also investigated asan important predictor on iron skarn mineralization. The mea-surement of the physical properties of magnetism and gravity ofthe rocks in the study area may provide techniques to directlydetect skarn iron mineralization. Aeromagnetic data measurethe magnetic field of various rock types and bouguer gravitydata measure density variations of the mass near surface and/orbeneath the surface. In general, both gravity and magnetic datacan be spatially associated with skarn iron mineralization.Thus, aeromagnetic and gravity data over the study area wereanalyzed to understand their spatial associations with themetallogenic environments and skarn iron deposits.

Method

The proposed method includes three key techniques: weights-of-evidence, a C4.5 decision tree andm-branch smoothing. Theweights-of-evidence techniquewas used to rank the importanceof each evidential map and determine the optimal buffer dis-tance for NWand EW fault line, fault intersections, Indosiniank-feldspar granite, Tanjianshan group. The C4.5 decision treewas used to construct a decision tree classifier to classify thegrid dataset. The m-branch smoothing technique transformedthe decision tree into a probability evaluation tree.

Contrast coefficient of weights-of-evidence

Weights-of-evidence is a statistical method that was first de-veloped for medical diagnoses. Bonham-Carter et al. (1989)adapted it for mineral prospectivity mapping, and it is now themost widely used method in mineral prospectivity mappingapplications. A certain grid size is chosen to divide the studyarea (corresponding to every evidential map) into a regulargrid (or unit cells). For this analysis it is assumed that only onemineral deposit is contained in each grid cell. Assuming thatthe total number of cells in the study area is T and the numberof cells containing a mineral deposit is M, then the a prioriprobability that a random grid will contain a mineral deposit isP(d)=M/T, and the a priori odds are defined as follows:

O dð Þ ¼ P dð Þ1−P dð Þ : ð1Þ

For each evidential map, let ej indicate the count of cellsassociated with the j-th evidence being present and ¬ej indi-cate the count of cells associated with the absence of evidence.The evidence weights for the j-th evidential map are expressedas follows:

Wþj ¼ ln

O dð je j� �O dð Þ ð2Þ

Fig. 1 Location of the eastern Kunlun Mountains within Qinghai Province, China. The 62 known Skarn iron deposits are shown as black dots

Earth Sci Inform (2014) 7:13–24 15

W −j ¼ lnO dð j⌝e jÞ

O dð Þ ð3Þ

The contrast coefficient can be defined as follows:

C j ¼ Wþj −W

−j ð4Þ

which indicates the strength of spatial association between theevidential maps and the known mineral deposits (trainingdata). The larger Cj is, the stronger the correlation betweenthe evidential maps and the known mineral deposits is. Thus,Cj could be used to rank the relative importance of each layerfor predicting mineralisation. For each evidential map, thevariance of weights can be calculated as follows

σ2 Wþj

� �¼ 1

N ej∩d� � þ 1

N ej∩⌝dÞ� ð5Þ

σ2 W −j

� �¼ 1

Nð⌝e j∩dÞ þ1

Nð⌝e j∩⌝dÞ ð6Þ

where ∩ is the intersection operator. N{Ej∩D} is the numberof cells in which the j-th evidence and a deposit exist.N E j∩D� �

is the number of cells in which the evidence existsbut there is no deposit. The variance of the j-th evidentialmap’s contrast can be calculated as follows:

σ2 C j

� � ¼ σ2 Wþj

� �þ σ2 W −

j

� �ð7Þ

Thus, statistic Stud(C) is defined as follows:

Stud Cð Þ ¼ C=δ Cð Þ ð8Þ

From formulas (2) and (3), contrastCj can be transformed to

C j ¼ lnO dð je jÞO dð j⌝e jÞ�

ð9Þ

Applying the a priori odds formula (1), the formula aboveis turned into the following:

C j ¼ lnP dð je jÞ 1−P dð j⌝e jÞ

�1−P dð je jÞ �

P dð j⌝e jÞ

!

¼ lnP dð je jÞP⌝dj⌝e j

� �P⌝dje j� �

P dð j⌝e jÞ

!

¼ lnð N d∩e j� �

Nð⌝d∩⌝e jÞN⌝d∩e jÞN d∩⌝e jÞ

��ð10Þ

Þ

Although all the spatial statistics calculated by the weightof the evidence technique are discussed here, only the Stud(C)statistic was used in our study.

Decision tree

C4.5 was presented by J.R. Quinlan (1986, 1993) and is anextension of ID3. C4.5 uses the gain ratio as an attributeselection measure instead of the gain, which is used in ID3,and adds significant functions compared to ID3, such asattribute discretization, rules generation and uncertainty pro-cessing functions.

Suppose that S is a set of cases, Freq(Ci, S) represents thenumber of cases in S that belong to class Ci, and │S│ standsfor the total number of cases in set S. If we randomly select acase from a set of cases S and this case belongs to class Ci,then the a priori probability of the event is as follows:

Pi ¼ Freq Ci; Sð Þ= Sj j ð11Þ

The information that the above description contains is

log2 Pið Þ ð12Þ

The expected information classifying the set of S cases intom classes is

Info Sð Þ ¼ −Xi¼1

m

pilog2 pið Þ ð13Þ

Info(S) is also called the entropy of set S. If we partition thecases in S on attribute A (with A having v distinct values),then the expected information is expressed as follows:

Inf oA Sð Þ ¼Xj¼1

v Dj

�� ��Dj j � Info Dj

� � ð14Þ

The gain is defined by the following:

Gain Að Þ ¼ Info Sð Þ−Inf oA Sð Þ ð15ÞGain is employed in ID3; however, it biases toward attri-

butes having many distinct values. The gain ratio was adoptedin C4.5 to overcome the bias. The split information is definedby the following:

SplitInf oA Dð Þ ¼ −Xj¼1

v Dj

�� ��Dj j � log2

Dj

�� ��Dj j

� ð16Þ

The gain ratio was calculated based on the following equation:

GainRatio Að Þ ¼ Gain Að ÞSplitInfo Að Þ ð17Þ

16 Earth Sci Inform (2014) 7:13–24

Probability-based ranking and smoothing

C4.5 is a supervised classification model that requires modifi-cation to make it useful for prediction? That is, to turn adecision tree into a probabilistic estimation tree. The leaf nodesof a decision tree represent the classification result, namely acertain category to which a case belongs. However, differentleaf nodes may own the same category. We can obtain a casethat belongs to some category, but we cannot be informed ofthe likelihood the case belongs to it. Commonly, theprobability-based ranking method is employed to rank thelikelihood a case belongs to a certain category. The method isimplemented by ranking the absolute class frequency of eachnode of the tree. Assume that a leaf node contains 100 cases,which is a subset of all cases, in which 90 cases and 10 casesbelong to the positive and negative classes, respectively. Then,we can estimate that if a case is allocated to this leaf node, thereis a 90 % likelihood that this case will be of the positive classand a 10 % likelihood that it will be of the negative class.Nevertheless, a number of problems exist with the abovemethod relating to limited class allocations. One solution tothis problem is to “smooth” the probability estimate. Differentsmoothing methods have been proposed, including LaplaceCorrection, m-estimation and m-branch (Zadrozny and Elkan2001; Ferri et al. 2002). Thus, the ranking based on theestimated class-membership probability methods, which areassociated with the smoothing probability, are applied to trans-form a classifier derived from C4.5 to a predictor. A prior work(Provost and Domingos 2003) has found that a C4.5 introduc-tion learner without pruning and without node “collapsing”(Quinlan 1993) can achieve the best prediction accuracy.

Among the three smoothing methods mentioned previous-ly, m-branch was adopted in this study. Unlike Laplace and m-estimates, which require a uniform class distribution assump-tion, the m-branch can be used in class imbalanced datasets. Adataset is imbalanced if the classes are not approximatelyequally represented. At the same time, it takes the classdistributions of non-leaf nodes into consideration. The m-branch is a recursive root-to-leaf extension of m-estimate.Suppose that a path from the root to leaf was expressed as<v1, v2,···,vd>, where v1 is the root node and vd is the leafnode. Then, the smoothed probability of a node j is code-termined by the probability of its parent node and the classdistribution itself. Let pi

0=1/c, where c is the number of clas-ses, and the recursive expression for computing the smoothedprobability is as follows:

pji ¼

nji þ m⋅pj−1

iXi∈c

n ji

!þ m

ð18Þ

where nij is the number of cases in node vj that belong to class

i. The height of a node is defined as h=d+1−j, where d is the

depth of the branch to which the node belongs and j is thedepth of the node. The normalized height of a node is definedas Δ ¼ 1− 1

h . Then, m is calculated as follows:

m ¼ M 1þΔ⋅ffiffiffin

p� �; ð19Þ

where M is a constant. The value of M at the leaves is M, theparent node is Mþ 1

2 ⋅M⋅ffiffiffin

p, the upper parent node is

Mþ 23 ⋅M⋅

ffiffiffin

p, and the root is Mþ d− 1

d ⋅M⋅ffiffiffin

p.

Experiments

A prospectivity map in the eastern Kunlun Mountains inChina for Skarn iron mineralisation was developed using theproposed methodology to test its suitability for prediction inmineral exploration. Eight evidential maps were created and62 known skarn iron deposits were used as training data forthe spatial analysis. The weights-of-evidence technique wasused to rank the importance of each evidential map anddetermine the optimal buffer distance for the NW and EWfault line, fault intersections, Indosinian k-feldspar granite,Tanjianshan group layers. A goodness-of-fit test and ten-foldcross-validation were employed to train and test the geo-datasets. Finally, iron prospectivity mapping was delineatedby partitioning the study area into a low-potential area,moderate-potential area and high-potential area. To validatethe feasibility of the method in this paper, traditional weights-of-evidence method is used as a comparison and ReceiverOperating Characteristic (ROC) curves and an Area UnderROC Curve (AUC) were used to evaluate the predictiveaccuracy of final predictive map.

Data processing

According to the geology characteristics of the study area andfundamental mineralization principles, evidential maps relatedto iron mineralization were developed, including NWand EWfault line, fault intersections, Indosinian k-feldspar granite,Tanjianshan group, aeromagnetic and gravitational data, andFe and Mn anomalies. The evidential maps were extractedfrom geological paper maps. Among them, NWand EW faultline, fault intersections, Indosinian k-feldspar granite,Tanjianshan group were derived from the geologic mapsprovided by the Qinghai Institute of Geological Survey,China. The maps are at a 1:500,000 scale. Aeromagnetic data,Bouguer gravity data and geochemical anomalies data are alsoavailable from the Institute in map format at a 1:500,000 scale.

Our studymakes use of equally sized square unit cells, calledgrids hereafter, to divide the 95,321 km2 study area into 96,580grids. The selection of the grid size is determined based on datasources such as geological maps. The maps are at a 1:500,000scale, or 2 mm on the maps represents 1,000 m on the ground.

Earth Sci Inform (2014) 7:13–24 17

With the choice of a ground cell size of 1,000×1,000 m, or a2×2 mm area on maps, the chance for possible locational erroris minimized when evidential maps such as the fault line,granite, and Tanjianshan group are extracted.

Some evidence factors have a radial influence on minerali-zation. For example, the fault lines have an influence on acertain range of area in mineralization, not only influencingthe crossing area lines. On the basis of the spatial proximitytheory, the influence of the evidence factors on an iron depositdecreases as the distance from the evidence factors increases. Incontrast, with increases in the buffer sizes, the chance to includea known iron deposit site increases because of the increase ofthe spatial extent. Thus, there is an optimal distance in whichthe interplay of the decline in the influence and increase in thenumber of sites reaches an equilibrium. The behavior of theinterplay between the known deposits and the evidence factorcan be quantified by variation of the Stud(C) values, as men-tioned in section 2.1. Thus, we choose the distances as theoptimal distances that obtain the maximal Stud(C).

Assume that d is the variable of the buffer distance, whered ∈ [a, b], Stud(d*) is the Stud(C) value given the conditiond = d*, D is the optimal buffer distance. Then, the process forfinding D can be expressed as follows:

D ¼ argmaxd∈ a;b½ �Stud dð Þ ð20Þ

We carry out optimal buffer distance analysis on the NWand EW fault line, fault intersections, Indosinian k-feldspargranite, Tanjianshan group. Distance is initialed with 500 m.Then, the analysis was run several times with increases in thedistance at intervals of 500 m until reaching 7,500 m. For theNW and EW fault line, the value of Stud(C) reached its peakof 1.5915 when the buffer distance was 500 m. For the faultintersections, the value of Stud(C) reached its peak of 1.4864when the buffer distance was 4,500 m. For the Indosinian k-feldspar granite, the value of Stud(C) reached its peak of6.9406 when the buffer distance was 2,000 m. For theTanjianshan group, the value of Stud(C) reached its peak of8.3882 when the buffer distance was 2,000 m. Related statis-tics are shown in Table 1.

In practice, many other preprocesses are indispensable, suchas dealingwithmissing values, interpolation, and discretization.All of these preprocess operations were implemented usingArcGIS9.3 software. Finally, every evidential map was inte-grated to derive input feature vectors, which are the input dataof our C4.5 model.

Experimental procedure and results

We followed the process of mineral prospectivity mapping asshown in Fig. 2. Initially, the evidential maps were compiledfrom the various data sources. Then, a spatial analysis wasperformed to establish the optimal buffer distances around the

various geological features. All the evidential maps were thenaggregated with the known iron deposit map into a singlemap. The input feature vectors were derived from this map.Next, using the feature vectors as training data, we builtdecision trees with the C4.5 model. To transform a DT to aPET, them-branch smoothingmethodwas employed to assigna predictive probability to each leaf node. As shown in Fig. 2,the leaf nodes were painted with different colors, which de-note different probabilities. Finally, test data were importedinto the constructed PET. The predictive probability of eachtest case, corresponding to the probability of each cell in amap, could be computed. The iron prospectivity map wasfinally developed using the predictive probabilities.

As mentioned in section 2.3, a full-grown decision tree,without pruning and without node “collapsing”, gains the bestpredictive accuracy. To turn off pruning and node “collapsing”,best predictive accuracy can be acquired. To turn off pruning totransform a DT to a PET, an m-branch function module mustbe added to the original C4.5 algorithm. Referring to the opensource code of C4.5 in GPL, we programmed a C4.5 version inC#. Our version of C4.5 simplified the original C4.5 programto only satisfy our requirements. The AUC assessment criteriawere added to our program as well.

For the evaluation of model performance, two differentexperimental procedures were used. We carried out thegoodness-of-fit test using the set of known iron deposits andthe ten-fold cross-validation for testing the predictive accuracyof the decision tree. In addition, using identical experimentaldata, traditional mineral prospectivity mapping model,weights of evidence, was also developed to compare withmodel constructed using the proposedmethodology. To assessthe performance of the predictive model, the widely usedROC curves were employed (Han et al. 2006). Compared tothe misclassification rate (or error rate), the ROC curve isgenerally considered a better assessment of classifiers.Moreover, it performs excellently in imbalanced datasets. Inaddition, it is an intuitive visual tool for comparing two classclassifiers (Ferri et al. 2002, 2003). Considering the imbal-anced dataset of our experiment, use of the ROC curve issuitable. Comparing the performances of classifiers, the closerthe ROC curve of a model is to the left top corner, the moreaccurate the model is (for the opposite scenario, the lessaccurate the model is). The AUC can provide a quantitive

Table 1 Optimal buffer distances of maps and Stud(C)

Evidential maps Optimal buffer distances C Stud(C)

NWand EW fault line 500 m 0.6032 1.5915

fault intersections 4,500 m 1.4987 1.4864

Indosinian k-feldspargranite

2,000 m 1.8431 6.9406

Tanjianshan group 2,000 m 2.2826 8.3882

18 Earth Sci Inform (2014) 7:13–24

evaluation of the models. This paper adopted a straightfor-ward way to estimate the AUC (Hand and Till 2001).

For the goodness-of-fit test, the feature vectors derived from96,580 cells (62 cells containing iron deposits) were used totrain a decision tree and evaluate the constructed model. Adecision tree with 865 leaves and a maximal depth of ninewas constructed. The misclassification rate of the decision treewas 0.1 %. For the ten-fold cross-validation, the dataset wasrandomly grouped into ten subsets, each of approximatelyequal size and approximately equal distributions of classes

(cells containing mineral deposits and cells not containingmineral deposits). At every iteration, nine of ten subsets wereused to collectively train the model, and the remaining onesubset was used as test data. The next iteration, the test subsetwas changed and the remaining subsets were used to train themodel. A total of ten iterations were performed, making surethat every subset had been used as test data (Fig. 3). In total, tendecision trees were constructed.

In the course of transforming a DT into a PET, formula (19)has a constant M. We selected the value of M by analyzing the

Fig. 2 The flow diagram of experiments

Fig. 3 The flow diagram often-fold cross-validation

Earth Sci Inform (2014) 7:13–24 19

accuracy of the model. That is, the value of M with which themodel obtained the best accuracy was chosen as the parameterof the m-branch. For the goodness-of-fit test, the accuracyreached 97.07 % of the AUC when M=1. The correspondingROC curve is depicted in Fig. 4. For the ten-fold cross-validation, the evaluation of the accuracy of AUC was95.10%whenM=10. The corresponding ROC curve is shownin Fig. 4. The variations of AUCwithM are shown in Fig. 5. InFig. 4, the curve of the goodness of fit is closest to the top leftcorner than the curve of the WofE and 10-fold cross validation.Accordingly, the model constructed using the goodness of fitreturned the better predictive accuracy, as anticipated.Additionally, the model constructed using goodness of fit testof the methodology returned the better predictive accuracy thanthat constructed usingWofE. And for theWofE, the accuracy ofAUC was 80.84 %.

After importing test data into the PET, each cell within thestudy area was associated with a value indicating the occur-rence probability of an iron deposit in the cell. The larger thevalue is, the better the chance of iron occurrence is. Also, in thespatial presentation of the probability values as a synoptic map,low-, moderate- and high-potential areas were typically used.Empirically, the low-potential area should be at least 70 % ofthe entire area. The moderate-potential area should be ∼20 %,and the high-potential area should be 10 % or less (He et al.2010a, b, 2012). Referring to our previous work in the easternKunlun Mountains, the proportions of 75 % for the low-potential area, 15 % for the moderate-potential area, and 10 %for the high-potential area were employed to partition the studyarea into low-, moderate- and high-potential areas. To group thecells into three categories, we need two dividing points. Inpractice, all the probabilities of the test grids were sorted inascending order. Then, we adopted two points, 75 % and 90 %,as the cut-off points. For the goodness-of-fit test, probabilityvalues of 0.00001 and 0.00088 were adopted. Thus, the valuesfor the divisions were ≤0.00001 for the low-potential area,0.00001∼0.00088 for the moderate-potential area, and>0.00088 for the high-potential area. For the ten-fold cross-validation, two cut-off points were adopted, 0.00327 and0.00937. Thus, the values for the divisions were ≤0.00327 forthe low-potential area, 0.00008∼0.00937 for the moderate-potential area, and >0.00937 for the high-potential area. Forthe WofE, two cut-off points were adopted, 0.000241 and0.000441. Then, the mineral prospectivity maps for thegoodness-of-fit test (Fig. 6), the ten-fold cross-validation(Fig. 7) and WofE (Fig. 8) were made with the three potentialareas. As shown in the maps, the majority of the iron depositswere within the high-potential area, which covered a relativelysmall area. After statistics, for the goodness-of-fit test, 91.94 %

Fig. 4 ROC curves of models

Fig. 5 AUC variation with M

20 Earth Sci Inform (2014) 7:13–24

of the total 62 iron deposits were within the high-potential area,8.06 % were within the moderate-potential area and 0 % werewithin the low-potential area. For the ten-fold cross-validation,82.26 % were within the high-potential area, 14.52 % werewithin the moderate-potential area and 3.22 % were within thelow-potential area. For the WofE, 51.61 % of the total 62 irondeposits were within the high-potential area, 24.19 % werewithin the moderate-potential area and 24.20 % were withinthe low-potential area. From comparison of statistics, our

methodology achieved better performance, for more iron de-posits within high andmoderate area while less within low area.

Discussion

The construction of decision tree classifiers does not require anydomain knowledge or parameter setting and therefore is appro-priate for exploratory knowledge discovery (Han et al. 2006).

Fig. 6 A iron prospectivity map for the goodness-of-fit test

Fig. 7 A iron prospectivity map for ten-fold validation

Earth Sci Inform (2014) 7:13–24 21

Compared with the other statistical methods, decision trees arenot restricted by statistical assumptions. For example, theweights-of-evidence method makes a theoretical assumption ofconditional independence. Only if this is satisfied can a weightof features be added into the model. In the same study area anddataset, granite and lithology layers cannot be added together inthe weights-of-evidence model because the strong associationbetween them will not pass the conditional independence vali-dation. However, geology expertise indicates that the occurrenceof iron has a strong relation to granite and other lithologies. Thedecision tree method can easily solve this problem, and withoutthe restriction of conditional independence, more evidentialmaps can be added into the modeling predictor.

In exploration geology, many factors and phenomena are co-related and exploration models and predictive maps depend onthese inter relationships for data collecting and explorationtargeting. The constructed decision trees build a map betweena phenomenon and its related factors, which is intuitive andeasily comprehensible. If-then rules, which are similar to theassociation rules generated by an A Priori Algorithm, can beextracted from decision trees.

The large number of attributes and their distinct valuesused in the experiments resulted insubstantial decision trees,which for convenience are shown as partial decision trees inFigs. 9 and 10. Within Figs. 9 and 10, the occurrence of ironand lack thereof were the classified results of the decision

Fig. 8 A iron prospectivity map for WofE

Fig. 9 Partial decision tree 1

22 Earth Sci Inform (2014) 7:13–24

trees. The flowed mark (A/B) denotes that this leaf nodecontains a total of A cases and B of them are misclassified.Thus, a rule can be extracted by the left branch of the decisiontree in Fig. 10 as follows:

If Tanjianshan group = True, K-feldspar granite = True,Aeromagnetic anomaly=0∼50, then iron occurrence =True.

This means if a case satisfies the if-condition, the case willbe classified as a positive class, or iron occurrence, and thepossibility of iron occurrence is approximately 50.4 %. Thepossibility of an iron occurrence was also calculated by m-branch smoothing. Likewise, a rule is extracted by the rightbranch as follows:

If Tanjianshan group = True, K-feldspar granite = True,Aeromagnetic anomaly=−100∼50, NW and EW faultline = True, fault intersections = False, then no depositsoccurrence.

One of the flaws in the mineral prospectivity mappingmethod introduced in this paper is in determining the valueof the constant M when using m-branch smoothing to trans-form a DTC into a PET. To obtain better accuracy, the param-eter value of M should be determined in the experimentsseveral times. After m-branch smoothing, we achieved rela-tive probabilities. We called the predictive probabilities rela-tive probabilities because they are not the same as Bayesianprobabilities. They do not strictly represent the chances ofevent occurrences. However, they have the same rankingorder as Bayesian probabilities. For example, for the partial

decision tree in Fig. 8, the leaf node of the left branch contains76 cases of no mineral deposits of a total of 76 cases, themiddle one contains 551 of 551, and the right one contains670 of 670. The more cases of no mineral deposits are withinthe leaves, the lower the assigned predictive probability of them-branch is, which satisfies our requirements for mineralprospectivity mapping.

Both decision trees, without pruning and collapsing, andm-branch smoothing can be applied to clustered datasets.Therefore, this two-step method is very suitable for mineralprospectivity mapping. The high efficiency and scalability ofa C4.5 decision tree algorithm was exemplified in this exper-iment. For the 96,580 cells with eight attributes and a total of70 distinct discrete values of attributes, only 0.737 s wereneeded to run a goodness-of-fit test on our ordinary PC(2.66 GHz Intel Core duo CPU, 2GB memory).

Conclusion

A novel method was proposed for mapping mineral pros-pectivity, integrating weights-of-evidence, C4.5 decision treeand m-branch smoothing techniques. The weights-of-evidencewas used as a tool for optimal buffer distance analysis. The two-step model performed well in mapping iron prospectivity in thestudy area. The majority of the iron deposits were within high-potential and moderate-potential areas, which constituted asmall proportion of the study area. Compared with statisticalmethods, decision trees are not restricted by statistical assump-tions. The high efficiency and scalability of decision trees had

Fig. 10 Partial decision tree 1

Earth Sci Inform (2014) 7:13–24 23

been presented in this case study. The fact that decision trees areintuitive and easily comprehensible is another advantage com-pared to models using techniques such as neural networks.Such a method can help distinguish prospectivity areas andreduce the risk of mineral exploration. The method is verysuitable for clustered datasets, like mineral prospectivity map-ping. Various spatial prediction problems similar to mineralprospectivity mapping can also be solved with decision trees.A wide range of prospective applications of decision trees canbe expected.

Acknowledgments This study was supported by grants to the Univer-sity of Electronic Science and Technology of China from the NationalNatural Science Foundation of China (Contract #41171302), Program forNew Century Excellent Talents in University(Contract #NCET-12-0096)and the National High-Tech Research and Development Program ofChina (Contract #2007AA12Z227). The authors thank Mr. YongchengZhuang of the Qinghai Institute of Geological Survey, China, for hissuggestions and assistance in fieldwork.

References

Abedi M, Norouzi GH, Bahroudi A (2012) Support vector machine formulti-classification of mineral prospectivity areas. Comput Geosci46(1):272–283

Agterberg FP (1992) Combining indicator patterns in weights of evidencemodeling for resource evaluation. Nat Resour Res 1(1):39–50

Agterberg FP, Bonham-Carter GF, Cheng QM, Wright DF (1993)Weights of evidence model and weighted logistic regression inmineral potential mapping. In: Davis JC et al (eds) Computers ingeology. Oxford University Press, New York, pp 13–32

Aminzadeh F (1994) Applications of fuzzy expert systems in integratedoil exploration. Comput Electr Eng 20(2):89–97

Bonham-Carter GF, Agterberg FP, Wright DF (1989) Weights of evi-dence modeling: a new approach to mapping mineral potential. StatAppl Earth Sci 89(9):171–183

Carranza EJM, Hale M (2001) Logistic regression for geologicallyconstrained mapping of gold potential, Baguio District, Philippines.Explor Min Geol 10(3):165–175

Carranza EJM (2004)Weights of evidence modeling of mineral potential:a case study using small number of prospects, Abra, Philippines. NatResour Res 13(3):173–187

Cui Y, He BB, Chen JH, He, ZH, Liu Y (2010) Mining metallogenicassociation rules combining cloud model with apriori algorithm.Proceedings of IEEE International Geoscience and Remote SensingSymposium, pp 4507–4510

Daneshfar B, Desrochers A, Budkewitsch P (2006) Mineral-potentialmapping for MVT deposits with limited data sets using Landsatdata and geological evidence in the Borden basin, northern Baffinisland, Nunavut, Canada. Nat Resour Res 15(3):129–149

Ferri C, Flach PA, Hern’andez-Orallo J (2002) Learning decision treesusing the area under the ROC curve. Proceedings of InternationalConference on Machine Learning, pp 139–146

Ferri C, Flach PA, Hernandez-Orallo J (2003) Improving the AUC ofprobabilistic estimation trees. Proceedings of the 14th EuropeanConference on Machine Learning, pp 121–132

Gumiel P, Sanderson DJ, Arias M, Roberts S, Martín-Izard A (2010)Analysis of the fractal clustering of ore deposits in the SpanishIberian Pyrite Belt. Ore Geol Rev 38(4):307–318

Han J, Kamber M, Pei J (2006) Data mining: concepts and techniques,2nd edn. Morgan Kaufmann Publishers Inc., San Francisco

Hand DJ, Till RJ (2001) A simple generalisation of the area under theROC curve for multiple class classification problems. Mach Learn45(2):171–186

Harris JR, Wilkinson L, Grunsky EC (2000) Effective use and interpre-tation of lithogeochemical data in regional mineral explorationprograms: application of Geographic Information Systems (GIS)technology. Ore Geol Rev 16(3):107–143

He BB, Chen CH, Liu Y (2010a) Gold resources potential assessment ineastern KunlunMountains of China combining weights-of-evidencemodel with GIS spatial analysis technique. Chin Geogr Sci20(5):461–470

He BB, Chen CH, Liu Y (2010b) Mineral potential mapping forCu-Pb-Zn deposits in the East Kunlun Region, Qinghai Province,China, integrating multi-source geology spatial data sets andextended weights-of-evidence modeling. GISci Remote Sens47(4):514–540

He BB, Cui Y, Chen JH (2011) A spatial data mining method for mineralresources potential assessment. IEEE The First InternationalConference on Spatial Data Mining and Geographical KnowledgeServices, pp 96–99

He BB, Chen JH, Chen CH, Liu Y (2012) Mineral prospectivity mappingmethod integrating multi-sources geology spatial data sets and case-based reasoning. J Geogr Inf Syst 4(2):77–85

Hwang SG, Guevarra IF, Yu BO (2009) Slope failure prediction using adecision tree: a case of engineered slopes in South Korea. Eng Geol104(1–2):126–134

Koike K, Matsuda S, Suzuki T, Ohmi M (2002) Neural network-basedestimation of principal metal contents in the Hokuroku District,Northern Japan, for exploring kuroko-type deposits. Nat ResourRes 11(2):135–156

Luo X, Dimitrakopoulos R (2003) Data-driven fuzzy analysis in quanti-tative mineral resource assessment. Comput Geosci 29(1):3–13

Porwal AK, Kreuzer OP (2010) Introduction to the special issue: mineralprospectivity analysis and quantitative resource estimation. OreGeol Rev 38(3):121–127

Porwal A, González-Álvarez I, Markwitz V, McCuaig TC, Mamuse A(2010) Weights-of-evidence and logistic regression modeling ofmagmatic nickel sulfide prospectivity in the Yilgarn Craton,Western Australia. Ore Geol Rev 38(3):184–196

Provost F, Domingos P (2003) Tree induction for probability-basedranking. Mach Learn 52(3):199–215

Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106Quinlan JR (1993) C4.5: programs for machine learning. Morgan

Kaufmann Publishers Inc., San FranciscoRigol-Sanchez JP, Chica-OlmoM, Abarca-Hernandez F (2003) Artificial

neural networks as a tool for mineral potential mapping with GIS.Int J Remote Sens 24(5):1151–1156

Wu X, Kumar V, Ross Quinlan J et al (2008) Top 10 algorithms in datamining. Knowl Inf Syst 14(1):1–37

Yeon YK, Han YG, Ryu KH (2010) Landslide susceptibility mapping inInjae, Korea, using a decision tree. Eng Geol 116(3–4):274–283

Zadrozny B, Elkan C (2001) Learning and making decisions when costsand probabilities are both unknown. Proceedings of the seventhACM SIGKDD international conference on Knowledge discoveryand data mining, pp 204–213

Zuo RG, Carranza EJM (2011) Support vector machine: a tool formapping mineral prospectivity. Comput Geosci 37(12):1967–1975

24 Earth Sci Inform (2014) 7:13–24


Recommended