Soil Classification System from Cone Penetration Test Data ... · sand-like, clay-like and...

Soil Classification System from Cone Penetration Test DataApplying Distance-Based Machine Learning Algorithms

L.O. Carvalho, D.B. Ribeiro

Abstract. Most work from the literature dedicated to soil classification systems from cone penetration test (CPT) data arebased on simple two-dimensional charts. One alternative approach is using machine learning (ML) to produce new soilclassification systems or to reproduce existing ones. The available studies within this research field can be consideredlimited, once most of them do not include more than two inputs within their analysis and are applicable only to specificregions. In this context, the aim of this work is to use distance-based ML techniques to replicate two chart-based methodsfrom the literature. Up to five input feature combinations are tested, with the objective of discussing geotechnical aspectsof soil classification systems. Results are compared using the statistical test of Friedman with the post-hoc statistics ofNemenyi and the signed-rank statistical test of Wilcoxon. The used dataset can be considered diversified because itcontains 111 CPT soundings from several countries. Results show that the used ML techniques maintain reasonableaccuracy when inputs are substituted and when incomplete data is used, which can lead to cost reduction in real engineeringprojects. It is important to notice that these observations would not be possible by using the replicated soil classificationsystems alone.Keywords: cone penetration test, distance-based algorithms, machine learning, soil classification system.

1. Introduction

Most available systems for soil classification fromCPT data use two-dimensional charts divided into regionswhich represent different soil types. Initially, these chartswere based on soil type (grain size and plasticity) and usedraw CPT data, like cone resistance and lateral friction(Begemann, 1965). Nonetheless, later studies producedbetter classification methods by focusing on soil behaviorand by proposing normalizations for CPT data (Douglas,1981; Robertson et al., 1986). Some popular classificationmethods make use of two charts instead of one, combiningthree normalized variables in pairs. In this context, somework propose normalizations to include the influence ofdepth and overburden (Robertson, 1990). Nevertheless,these methods are not accurate for offshore soils (Jefferies& Davies, 1991) due to the dilative behavior of highlyoverconsolidated clays commonly found in deep watersoils (Robertson, 1991). This limitation is supported by ex-perimental data (Jefferies & Davies, 1991; Ramsey, 2002).Thus, these methods fail to distinguish stiff or dense granu-lar soils from overconsolidated clay (Schneider et al.,2008). Different normalizations and charts were proposedto address this problem (Schneider et al., 2008; Schneideret al., 2012). Nonetheless, Robertson (2016) affirms thatsoil classification systems that use charts may not be reli-able for structured soils, meaning aged or cemented, likesome offshore soils. He also recommends to consider soils

structured if a modified normalized small-strain rigidity in-dex KG

* is above 330, although some geotechnical judgmentis required.

Another possible approach for soil classification sys-tems from CPT data is based on the use of statistics and MLtechniques. Most authors interested in solving general geo-technical problems use artificial neural networks (ANN) topredict values of interest such as soil parameters (Goh,1995; Goh, 1996; Schaap et al., 1998; Juang & Chen, 1999;Kumar et al., 2000; Juang et al., 2002; Juang et al., 2003;Hanna et al., 2007). Nevertheless, one can find work usingsupport vector machines (SVM) (Goh & Goh, 2007), deci-sion trees (DT) (Livingston et al., 2008) and random forests(RF) (Kohestani et al., 2015). For soil classification sys-tems there are two main approaches, one is replicating ex-isting soil classification systems and the other is trying topropose new ones. Most work in this research field are ded-icated to the latter approach, using data clustering (Hegazy& Mayne, 2002; Facciorusso & Uzielli, 2004; Liao &Mayne, 2007; Das & Basudhar, 2009; Rogiers et al., 2017).Usually, among the few work that investigate replicatingexisting soil classification systems such as Robertsoncharts (Arel, 2012), the only ML technique tested is ANN(Kurup & Griffin, 2006; Reale et al., 2018). Nonetheless,there is a study that compares different ML techniqueswhen replicating existing systems for soil classification(Bhattacharya & Solomatine, 2006), although the used

Soils and Rocks, São Paulo, 42(2): 167-178, May-August, 2019. 167

Lucas Orbolato Carvalho, M.Sc., Departamento de Geotecnia, Divisão de Engenharia Civil, Instituto Tecnológico de Aeronáutica, São José dos Campos, SP, Brazil, e-mail:[email protected] Betioli Ribeiro, Associate Professor, Departamento de Geotecnia, Divisão de Engenharia Civil, Instituto Tecnológico de Aeronáutica, São José dos Campos, SP, Brazil,e-mail: [email protected] on March 12, 2018; Final Acceptance on July 8, 2019; Discussion open until December 31, 2019.DOI: 10.28927/SR.422167

dataset is restricted to few CPT soundings which are alltaken from the same location. Other work related to classi-fying soil with ML are Bilski & Rabarijoely (2009), Rao etal. (2016) and Chandan & Thakur (2018).

In this work, two chart-based soil classification sys-tems proposed by Robertson (1991) and Robertson (2016)are replicated using distance-based ML techniques. Thesetechniques were elected among other options for their sim-plicity and because there is a lack in the literature for thistype of approach. The objective is to investigate and discussgeotechnical aspects of soil classification systems that cannot be disclosed by using the original Robertson methods.First, the stratigraphic profiles of 111 CPT soundings takenwithin several countries are obtained using a student ver-sion of CPeT-IT v2.0.2.5 software (Ioannides & Robertson,2016), which employs Robertson charts in a soil classifica-tion system. Next, the so-called k-nearest neighbor (KNN)and distance-weighted nearest-neighbor (DWNN) MLtechniques are used to replicate Robertson (1991) and Ro-bertson (2016) charts. For each ML technique and eachclassification method, 33 input feature combinations aretested and all results are compared using the Friedman sta-tistical test (Friedman, 1937) with the Nemenyi post-hocstatistics (Nemenyi, 1963) and the Wilcoxon statistical test(Wilcoxon, 1945). The proposed discussions produced sev-eral original contributions, like showing that:1. Distance-based ML techniques are capable of reproduc-

ing Robertson soil classification systems with goodaccuracy;

2. Reasonable accuracy can be obtained without norma-lizations proposed in the literature for the CPT data;

3. Including soil age as an input feature contributes for dis-tinguishing between soil classes.

2. Soil Classification Systems

In this work, two soil classification systems availablewithin a student version of CPeT-IT software are replicatedusing distance-based ML techniques. The objective of thissection is to present the theory that sustains each of thesemethods.

2.1. Influenced by soil type (IST)

The first replicated method is based on the work ofRobertson (1991). Although it was idealized to be orientedtowards a behavioral classification, the labels assigned toclasses are inspired by conventional soil type classes,showing even some compatibility with real soil types(Kurup & Griffin, 2006). For this reason, this method ishere considered influenced by soil type, being referred to asIST throughout this text. It adopts nine possible soils types,within which two are said to be heavily overconsolidated orcemented. The IST classes are in Table 1.

The initial inputs used by CPeT-IT to classify soilwith the IST method are raw CPT data, named cone resis-tance qc (MPa), lateral friction fs (kPa), pore pressure mea-

sured behind the cone tip u2 (kPa) and depth z (m). Thesevalues are used to obtain the input features originally con-sidered by Robertson (1990), named normalized cone resis-tance Qt1 (Eq. 1), normalized friction ratio Fr (Eq. 2) andnormalized excess pore pressure Bq (Eq. 3). The cone resis-tance normalization was later updated to Qtn (Schneider etal., 2008), resulting in the charts presented in Figs. 1a and1b. Beside the nine classes predicted within these charts, anadditional class 0 is used for misclassified soils.

To obtain the normalized values, first the raw cone re-sistance qc is replaced by the total cone resistance qt, tocompute the pore pressure assisting cone penetration. Nextstep is estimating the soil unit weight � (kN/m3) (Lunne etal., 2002; Mayne et al., 2010; Mayne, 2014), which is usedto obtain the total overburden pressure �v0 (kPa) and the ef-fective overburden pressure �v0’ (kPa). If the water table isnot known, it can be estimated by fitting a straight line inthe chart z � u2 (Fig. 2) when a drained penetration is ob-served. The water table depth is then used to compute theequilibrium pore pressure u0, which is used to determine theexcess pore pressure u2 - u0.

Given these estimations, the following normaliza-tions are obtained:

Qq

tt v

v

10

0

��

�(1)

Ff

qr

s

t v

�� 0

(2)

Bu u

qq

t v

��

�2 0

0�(3)

Nevertheless, work from the literature state that theexponent n of �v0’ (n = 1 in Eq. 1) should vary from 0.5 forsands to 1 for clays (Zhang et al., 2002). To calculate n, onecan consider its correlation with the classification index Ic

(Robertson, 2009):

I Q Fc tq r� � � �[( . log ) (log . ) ] .3 47 122 22 0 5 (4)

The normalized cone resistance Qtn is then given by:

168 Soils and Rocks, São Paulo, 42(2): 167-178, May-August, 2019.

Carvalho & Ribeiro

Table 1 - IST classes.

1) Sensitive, fine grained

2) Organic soils – peats

3) Clays – clay to silty clay

4) Silt mixtures – clayey silt to silty clay

5) Sand mixtures – silty sand to sandy silt

6) Sands – clean sand to silty sand

7) Gravelly sand to sand

8) Very stiff sand to clayey sand

9) Very stiff, fine grained

Qq

p

ptn

t v

a

a

v

n

��

��

�

��

��

�

��

�

�0

0

(5)and the exponent n can be written as:

n Ip

cv

a

� ��

��

�

�� 0381 0 05 0150. . .

�(6)

where pa = 0.1 MPa is a reference pressure.The CPeT-IT software uses only the Qtn � Fr chart to

generate the soil classification system outputs. Soil is con-sidered misclassified and is labeled with class 0 if the val-ues obtained for Qtn and Fr are not within the ranges pre-sented in this chart.

2.2. Focused on soil behavior only (FSB)

The system proposed by Robertson (2016) estab-lishes a full behavioral-oriented soil classification, which iswhy it is here considered more focused on soil behavior andnamed FSB throughout this text. FSB method includesseven classes (Table 2).

One can observe that the three main soil types aresand-like, clay-like and transitional. Each of these soiltypes is divided into contractive or dilative. A seventh class


Soil Classification System from Cone Penetration Test Data Applying Distance-Based Machine Learning Algorithms

Table 2 - FSB classes.

1) CCS: Clay-like – Contractive – Sensitive

2) CC: Clay-like – Contractive

3) CD: Clay-like – Dilative

4) TC: Transitional – Contractive

5) TD: Transitional – Dilative

6) SC: Sand-like – Contractive

7) SD: Sand-like – Dilative

Figure 1 - a) Qtn � Fr chart from Robertson (1991) updated by Robertson (2009). b) Qtn � Bq chart from Robertson (1991) updated by Rob-ertson (2009).

Figure 2 - Excess pore pressure.

is reserved for contractive clays that have high sensitivity todisturbance, which can be related to the friction ratio usingthe expression St = 7.1/Fr (Robertson, 2009). If sensitivity isgreater than 3, which corresponds to Fr < 2%, then the clayis considered sensitive. The upper limit for the normalizedcone resistance for sensitive clays is defined as 10 becausethey are soft.

Likewise for the IST system, qc, fs, u2 and z are the ini-tial inputs used by CPeT-IT to classify soil with the FSBsystem. Nonetheless, in this case the soil classification sys-tem is based on the charts shown in Figs. 3 and 4, which usethe normalized cone resistance Qtn, the normalized frictionratio Fr and the normalized excess pore pressure U2

(Schneider et al., 2008) as inputs. The FSB method also in-cludes a class 0 for misclassified soil, which is identified ifQtn, Fr or U2 are not within the ranges presented in the chartsand if the class given by both charts is not the same.

The excess pore pressure normalization U2 is ob-tained as:

Uu u

v

22 0

0

��

(7)

The curves that separate soil classes are inspired bySchneider et al. (2008) and Schneider et al. (2012). TheQtn � Fr chart has closely circular curves in the IST method,while in Robertson (2016) the curves have hyperbolic sha-pes as suggested by Schneider et al. (2012). The Qtn � U2

chart was taken from Schneider et al. (2008) with minorchanges, containing the classes originally proposed there.

3. Distance-Based Techniques

In this work, distance-based ML techniques are usedto replicate the soil classification systems described in Sec-tion 2. These ML techniques have the advantage of using anapproach similar to the chart-based methods to be repli-cated, representing soil examples as points in a space com-posed by the input features. It also uses the hypothesis that,if two soil examples produce close points, they are similar.One way of measuring the distance between points is withthe Euclidean metric. Considering a pair (xi, xj) of objects ina d-dimensional feature space, the distance between them isgiven by:

dist x x x xi j il

jl

l

dp( , ) � �

��

2

1

(8)

The distance-based ML algorithms used in this workpredict the class of an unknown example using a dataset ofexamples whose classes are known. The simplest strategyis detecting which known example produces a point that isthe nearest neighbour of the point that represents the un-known example. It is then assigned to the unknown exam-ple the same class of its nearest neighbor (Cover & Hart,1967).


Carvalho & Ribeiro

Figure 3 - Qtn � Fr chart from Robertson (2016).

Figure 4 - Qtn � U2 chart from Robertson (2016).

It is also possible to use an arbitrary number k of near-est neighbors and decide the class of the unknown exampleby voting, which corresponds to the k-nearest neighbors(KNN) technique. Tests can be performed to calibratewhich k leads to best predictive performance. In this work,only odd values of k are tested starting from one, increasingk until decreasing predictive performance is observed.

It is also possible to weight the votes, so that closerneighbors are more valued than farther ones. In this case,the technique is named distance-weighted nearest neighbor(DWNN) (Dudani, 1976). One specific way for definingthese weights is by using Gaussian weighting, which is de-fined by the following expression (Hechenbichler &Schliep, 2004):

w dist edist

( ) ��1

2

1

22

�(9)

where dist is the distance value. In this work, the KNN andthe DWNN with Gaussian weighting are used and com-pared to replicate the soil classification systems presentedin Section 2.

4. Methodology

4.1. Datasets description

The ML programs used in this work require a datasetof known examples to predict new examples. This datasetcan be formatted as a table, where each line represents a dif-ferent soil example. Input features are represented as col-umns and the last column contains the output feature. Ta-ble 3 presents a sample with 10 soil examples (lines),within a 0.45 m soil layer. In this sample, the inputs are rawCPT data and the output is the corresponding IST soil class,obtained using the CPeT-IT software. This program is alsoused to produce other input-output combinations for theML techniques, as described with more detail in Sec-tion 4.3.

Thirty eight of all CPT soundings used to composethe datasets were sent directly by Professor P.K. Robertson.

They are the same ones used in Robertson (2016) to pro-duce the FSB method, which is described in Section 2.2.Once detailed information about these soundings can befound in the original reference, only a brief descriptionabout them is presented in Table 4.

The first column of Table 4 gives a general descrip-tion of the soil types within the CPT soundings. The secondidentifies where soundings were taken and the third givesthe geological age when the soil was deposited. The lastcolumn presents a discrete ordered variable named “classof geology” (CG), considering the most recent age 1 and theother numbered sequentially to the oldest. The informationfrom these 38 soundings plus the variable CG compose thehere named geological dataset. The objective of includingCG, as an input feature in some of the studies presented inSection 5, is investigating if information about geologicalage can help differentiate one soil class from the other.

Another 73 CPT soundings were obtained from thewebsite of Professor P.W. Mayne, whose information issummarized in Table 5. Further detail about the soundingscan be found on the website. All these soundings weretaken within the United Sates of America and more specificinformation about location is presented in the table. Infor-mation about geological age was not available for thesesoundings, so they are not included in studies that make useof the variable CG. These soundings grouped with the onessent by Robertson compose the here named complete data-set, totalizing 111 CPT soundings. All CPT data used in thisdataset were taken in intervals of 2 to 5 cm, the pore pres-sure was measured behind the cone tip (u2) and the raw conetip resistance qc was corrected to qt using CPeT-IT.

4.2. Data preprocessing

In this work, all CPT data were used to classify soilusing the CPeT-IT software, which was later replicated us-ing the methods described in Section 3. The accuracy of thefinal results depends on the quality of the used datasets,which can be improved with data preprocessing.

The first problem is that distance-based ML tech-niques are sensitive to data scale. When the distance be-tween points is calculated, the importance of input featuresthat vary within large ranges tends to be emphasized, whilethe ones with low variation tend to be ignored. The solutionadopted here is normalizing all input features to the interval[0, 1].

Another issue is that data taken within CPT soun-dings can contain noise, which is here defined as any vari-able becoming severely different from what it was sup-posed to be. Noise can have several causes, like sensorerrors, formatting problems and human mistakes. The mainnoise types are missing data and outliers, which are here de-fined as distorted or corrupted values. CPeT-IT is unable toclassify most noisy examples, assigning class 0 in both ISTand FSB methods or no class whatsoever. Once the ML



Table 3 - Sample of soil examples.

Inputs Output

z (m) qc (MPa) fs (kPa) u2 (kPa) IST Class

13.00 16.93 55.84 120.32 6

13.05 16.53 46.32 124.47 6

13.10 10.14 36.69 129.95 6

13.15 7.13 24.76 146.66 6

13.20 4.92 22.80 158.69 6

13.25 3.90 21.47 163.42 5

13.30 3.28 21.46 159.58 5

13.35 2.73 23.89 153.03 5

13.40 3.70 33.80 148.37 5

techniques presented in Section 3 are here used to replicateCPeT-IT, these errors tend to be also replicated.

Although it is difficult to completely eliminate noisefrom the datasets, it is desirable to reduce them as much aspossible in order to avoid classification errors. In this work,

dataset cleaning was first performed manually, removingthe noisy examples that could be easily identified. This pro-cedure was then complemented by an automatic cleaningprocedure that makes use of the box-plots of the input fea-tures, as illustrated in Fig. 5.

In the box-plot, the base of the rectangle representsthe first quartile Q1 and the top of the rectangle representsthe third quartile Q3. The whiskers above and below therectangle represent the interval [Q1 - 1.5 � IQ, Q3 + 1.5 �IQ], where IQ = Q3 - Q1. Values outside this range (whitecircles) are identified as potential outliers. Preliminary testshave shown that removing all potential outliers affects ac-curacy, which indicates that relevant information is beingeliminated. To solve this problem, the Edit Nearest Neigh-bor technique (Wilson, 1972) is used in this work as a sec-ond criterion to decide if each potential outlier will be, infact, removed. This technique compares the potential out-lier with its nearest neighbor and removes it only if theclasses given by CPeT-IT do not match.

This procedure is illustrated in Fig. 6 for two inputfeatures, where the white dot represents the potential outlierand the black dots represent other known examples fromthe dataset. The numbers close to each dot represent theclass assigned by CPeT-IT. One can observe that, in this ex-ample, the classes of the potential outlier and its nearest


Carvalho & Ribeiro

Table 4 - Geological dataset (Robertson, 2016) (Classification in terms of geology age – GC).

General soil type Identification Geological age CG

Mixed Soils UBC, Canada Holocene 2

Venice Lagoon, Italy Holocene 2

Ford Center, USA Pleistocene 4

San Francisco, USA Late Pleistocene 3

Tailings, USA Recent 1

UBC KIDD, Canada Holocene 2

UBC KIDD, Canada (2) Holocene 2

Soft Clay Bothkennar, RU Holocene 2

Burswoord, Perth, Australia Holocene 2

Onsoy, Norway Holocene 2

Amherst, USA Late Pleistocene 3

San Francisco Bay, USA Holocene 2

San Francisco Bay, USA (2) Holocene 2

Soft Rock

Newport Beach, USA Miocene 5

LA Downtown, USA Miocene 5

Newport Beach, USA (2) Miocene 5

Stiff Clay

Madingley, UK Cretaceous 6

Houston, USA Pleistocene 4

Table 5 - Number of CPTs and test location from P.W. Mayne da-tabase (acquired in years 2000 – 2003).

Location Number of soundings

Gosnell, Arkansas 1

Lenox, Tennessee 1

Memphis, Tennessee 16

Dexter, Missouri 6

Mooring, Tennessee 6

Marked Tree, Arkansas 19

Collierville, Tennessee 1

Meramec, Missouri 4

Opelika, Alabama 4

Wilson, Arkansas 4

Wolf, Wyoming 7

Wyatt, Missouri 4

Total 73

neighbor are the same. This means that this potential outlierwill be maintained.

The next issue to be evaluated is if the number of ex-amples within each soil class is balanced, considering bothIST and FSB methods. Severe unbalance can compromisethe accuracy of distance-based ML techniques because theytend to focus majority classes and ignore minority classes.The distribution of examples among classes can be checkedusing histograms, as presented in Figs. 7a and 7b for thecomplete dataset and Figs. 7c and 7d for the geologicaldataset.

One can observe that the classes are, in fact, imba-lanced, which is expected for real CPT soundings. In thiswork, data imbalance is prevented by eliminating examplesof majority classes and creating new artificial examples forminority classes. Preliminary results have shown that ran-

dom elimination does not affect predictive performance,which can be explained by the fact that CPT data containsredundancies due to several data items being taken withineach soil layer.

To create new artificial examples for minority clas-ses, the SMOTE (Chawla et al., 2002) technique was used.For better distribution within the input feature space, it ishere proposed to estimate each d-dimensional new artificialobject from d + 1 original examples. This corresponds tothe vertex number of a d-dimensional simplex. The maxi-mum between 1000 and two times the number of elementsof the minority class was stipulated as the final number ofelements of each class for the balanced dataset. Once class0 of the IST method could not be well represented withinthe geological dataset even with the use of SMOTE, exam-ples of this class were completely removed from the geo-logical dataset.

4.3. General strategy

Two ML algorithms are tested and compared, theclassical KNN and the Gaussian DWNN, with respect totheir capacity for replicating the IST and FSB soil classifi-cation systems. This comparison is made using several in-put feature combinations, including three basic sets:• First set: depth z (m), corrected cone resistance qt (MPa),

lateral friction fs (kPa) and pore pressure behind the conetip u2 (kPa);

• Second set: depth z (m), normalized cone resistance Qt1,normalized lateral friction Fr (%) and normalized porepressure Bq;

• Third set: depth z (m), normalized cone resistance Qtn,normalized lateral friction Fr (%) and normalized porepressure U2.

The first set contains only non-normalized parame-ters, the second contains inputs of the IST method com-bined with depth and the third contains inputs of the FSBmethod combined with depth. For the main analysis, allcombinations of two, three and four input features withineach set were tested, although not all of them are presentedin Section 5. Additional selected input feature combina-tions are tested in three complementary studies.

In order to generate statistically relevant compari-sons, a 10-fold cross-validation procedure (Stone, 1974)was applied to evaluate classification accuracy. The proce-dure starts by randomly separating the dataset into ten parti-tions or folds with approximately the same size, maintain-ing the same proportion between classes observed in thecomplete dataset. At each cross-validation round, one parti-tion is left for testing, one partition (chosen at random) ischosen as a validation set and the remaining partitions com-pose the training set. The validation set is used to calibratethe best number of neighbors k to be used in the dis-tance-based algorithms.

For each cross-validation round, the average of theaccuracies per class are taken. This avoids disregarding mi-



Figure 5 - Box-plot example using generic numbers. The rectan-gle represents ordinate values within the 1st and 3rd quartiles andthe circles represent outliers.

Figure 6 - Edit nearest neighbor technique, with two possibleclasses (1 and 2) and black dots representing known examples.The unknown example (white) is labeled with the class of its near-est neighbor.

nority classes in the performance evaluation. After all foldsare used for testing, a mean and a standard deviation accu-racy performance are computed. For comparing the resultsof the experiments, the Friedman statistical test (Friedman,1937) with the Nemenyi post-hoc statistics (Nemenyi,1963) and the Wilcoxon statistical test (Wilcoxon, 1945)are used, based on the 10 accuracies recorded (per testfold).

5. Results and Discussion

A total of 132 classification results were generated toproduce the comparisons presented in this main analysis: 2replicated classification methods (IST and FSB describedin Section 2) � 33 input feature combinations � 2 dis-tance-based classification algorithms. The units used forthe input features are z (m), qt (MPa), fs (kPa), u2 (kPa), Fr

(%) and the other ones are dimensionless. Each predictedsoil class is compared to the one originally given by CPeT-IT to compute accuracy. Tables present the mean and stan-

dard deviation accuracy obtained within the 10-fold cross-validation procedure described in Section 4.3.

Combinations that presented best performance withKNN for replicating IST outputs are presented in Table 6.Once the first combination uses the original IST inputs andoutputs, it was expected that it would lead to the highestmean accuracy among all. Nonetheless, results of the Fried-man statistical test with the Nemenyi post-hoc statisticsshow a statistical equivalence between the first two combi-nations in Table 6. Thus, the last two combinations shownin Table 6 can be considered of lower performance. This


Carvalho & Ribeiro

Table 6 - Best KNN predictive results for replicating IST.

Inputs Elected k Mean SD

Qtn Fr 1 96.52 0.57

Qtn Fr U2 3 94.70 0.96

Qtn z Fr 1 93.49 0.67

Qtn z Fr U2 1 92.54 1.12

Figure 7 - Histograms. (a) For IST classes and the complete dataset. (b) For FSB classes and the complete dataset. (c) For IST classesand the geological dataset. (d) For FSB classes and the geological dataset.

shows that including more features among the original onesdoes not contribute to improve performance in this case.

The same comparison is proposed for the combina-tions that lead to the best performance with the GaussianDWNN technique for replicating IST outputs, which arepresented in Table 7. One can observe that results are veryclose to those presented in Table 6, reinforcing the sameconclusions.

Considering now the classical KNN technique forreplicating FSB outputs, the best feature combinations arepresented in Table 8. In this case, the Friedman statisticaltest with the Nemenyi post-hoc statistics show that last twofeature combinations are equivalent and statistically betterthan the first two. One can observe that, as expected, thebest combination for this case include all three original FSBinputs, named Qtn, Fr and U2. However, associating depth tothese features contributed to improve performance, evenwith the biasing due to the way in which the outputs weregenerated.

In the end, the feature combinations that producedbest performance for the Gaussian DWNN technique forreplicating FSB outputs are presented in Table 9. One canobserve that the results are very close to the ones from Ta-ble 8, reinforcing that using original FSB inputs leads togood accuracy and that including depth among these fea-tures contributes to improve performance.

Concerning more general observations, both testedML techniques presented good performance for replicatingboth soil classification systems. With respect to the non-normalized inputs, good performance can be observedwhen they are associated with depth. For IST and both MLtechniques, for example, accuracy is around 70% whenonly qt and fs are used as input features, but rises close to90% when z is included. These observations suggest thatproposing a soil classification system that uses only rawCPT data would be feasible if depth is included. Neverthe-less, one should notice that confirming this hypothesiswould require further studies.

Another general observation concerns evaluatingwhich classification technique is better, comparing theclassical KNN and the Gaussian DWNN. The Wilcoxontest was employed for this task adopting a p-value of atmost 5%. Comparing all combinations, results show thatthe Gaussian DWNN presents better predictive perfor-mance than the classical KNN.

6. Conclusions and Recommendations

In this work, distance-based ML techniques are usedto replicate systems for soil classification from CPT data. Itis important to notice that the proposed discussions and ob-tained conclusions would not be possible by using the origi-nal soil classification systems alone, because these originalmethods do not allow changing input features. It was theflexibility of the ML techniques that made possible to eval-uate if raw inputs without normalizations have enough in-formation for reproducing the original methods accurately,for example.

The main advantages of the proposed approach arethe ease of applying it to different datasets and little adapta-tion required for it to be associated with other ML tech-niques. The use of distance-based techniques can also beconsidered advantageous for its simplicity, once accurateresults were obtained. Thus, the presented method can beconsidered rigorous compared to other work from the liter-ature that make use of ML applications in geoscience,which do not present a data analysis as detailed as in Sec-tion 4.

A total of 132 tests were performed to draw the dis-cussions and conclusions presented and in all of them themean accuracy is above 85%, which can be considered rea-sonable within geotechnical applications. Notice the goodresults obtained using raw parameters, which suggests thatwould make sense to dismiss some types of data normaliza-tion that are proposed in the literature for soil classificationsystems. Reducing data normalization is advantageous be-cause any data transformation proposed to the originaldataset tend to diminish its original information, specially ifthe original number of input features is reduced. Resultspresented here are not sufficient to affirm that using raw pa-rameters would lead to greater performance, nonetheless



Table 9 - Best DWNN predictive results for replicating FSB.


Qtn Fr 7 88.90 0.40

Qtn z Fr 1 91.86 0.28

Qtn Fr U2 1 93.02 0.38

Qtn z Fr U2 1 93.83 0.55

Table 8 - Best KNN predictive results for replicating FSB.


Qtn Fr 7 88.79 0.40

Qtn z Fr 1 91.86 0.28

Qtn Fr U2 3 92.97 0.46

Qtn z Fr U2 1 93.83 0.55

Table 7 - Best DWNN predictive results for replicating IST.


Qtn Fr 1 96.52 0.57

Qtn Fr U2 1 94.63 0.98

Qtn z Fr 1 93.49 0.67

Qtn z Fr U2 1 92.63 1.02

they can justify future studies about this issue. Other con-clusions to be pointed out are:• Highest accuracies were obtained when using the origi-

nal IST inputs and outputs;• Including depth as an input increased accuracy, in most

cases;• Gaussian DWNN is better than the classical KNN, con-

sidering the Wilcoxon test with a p-value of at most 5%.Future studies that can be conducted include applying

and comparing different ML techniques to this same prob-lem, discussing other geotechnical issues about soil classi-fication systems that can not be exposed using distance-based techniques. Another possible investigation is apply-ing clustering techniques to the problem, taking advantageof the ease of increasing dimensionality to test several nor-malized and non-normalized feature combinations. Thus,CPT data can be associated with data from other in situ ex-periments like the standard penetration test or the flat dila-tometer test, exploring the problem with even higherdimensionality.

AcknowledgmentsTo Peter K. Robertson and Paul W. Mayne for mak-

ing available the dataset used in this work. This research didnot receive any specific grant from funding agencies in thepublic, commercial, or not-for-profit sectors.

Computer Code AvailabilityThe codes produced to generate all results presented

in this paper can be downloaded from the following link:https://github.com/Orbolato/KNN.git.

ReferencesArel, E. (2012). Predicting the spatial distribution of soil

profile in Adapazari/Turkey by artificial neural net-works using CPT data. Computers and Geosciences,43:90-100.

Begemann, H.K.S. (1965). The friction jacket cone as anaid in determining the soil profile. Proc. 6th Int. Conf.on Soil Mech. and Found. Engn., ICSMFE, Montreal,v.1, pp. 8-15.

Bhattacharya, B. & Solomatine, D.P. (2006). Machine lear-ning in soil classification. Neural Networks,19(2):186-195.

Bilski, P. & Rabarijoely, S. (2009). Automated soil catego-rization using CPT and DMT investigations. Proc. 2nd

Int. Conf. on New Developments in Soil Mechanics andGeotechnical Engineering, Nicosia, North Cyprus, v. 1,pp. 1-8.

Chandan, T.R. (2018). Recent trends of machine learningin soil classification: A review. International Journal ofComputational Engineering Research (IJCER),8(9):25-32.

Chawla, N.V.; Bowyer, K.W.; Hall, L.O. & Kegelmeyer,W.P. (2002). SMOTE: Synthetic minority over-sam-

pling technique. Journal of Artificial IntelligenceResearch, 16:321-357.

Cover, T.M. & Hart, P.E. (1967). Nearest neighbor patternclassification. IEEE Transactions on Information The-ory, 13(1):21-27.

Das, S.K. & Basudhar, P.K. (2009). Utilization of self-organizing map and fuzzy clustering for site character-ization using piezocone data. Computers andGeotechnics, 36(1-2):241-248.

Douglas, B.J. (1981). Soil classification using electric conepenetrometer. In Symp. on Cone Penetration Testingand Experience, Geotech. Engrg. Div., American Soci-ety of Civil Engineers, New York, NY, pp. 209-227.

Dudani, S.A. (1976). The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man,and Cybernetics, SMC-6(4):325-327.

Facciorusso, J. & Uzielli, M. (2004). Stratigraphic profilingby cluster analysis and fuzzy soil classification frommechanical cone penetration tests. Proc. ISC-2 on Geo-technical and Geophysical Site Characterization. Rot-terdam, v. 1, pp. 905-912.

Friedman, M. (1937). The use of ranks to avoid the assump-tion of normality implicit in the analysis of variance.Journal of the American Statistical Association,32(200):675-701.

Goh, A.T. (1995). Modeling soil correlations using neuralnetworks. Journal of Computing in Civil Engineering,9(4):275-278.

Goh, A.T. (1996). Neural-network modeling of CPT seis-mic liquefaction data. Journal of Geotechnical Engi-neering, 122(1):70-73.

Goh, A.T. & Goh, S.H. (2007). Support vector machines:Their use in geotechnical engineering as illustrated us-ing seismic liquefaction data. Computers and Geotech-nics, 34(5):410-421.

Hanna, A.M.; Ural, D. & Saygili, G. (2007). Neural net-work model for liquefaction potential in soil depositsusing Turkey and Taiwan earthquake data. Soil Dynam-ics and Earthquake Engineering, 27(6):521-540.

Hechenbichler, K. & Schliep, K. (2004). Weighted k-near-est-neighbor techniques and ordinal classification. Col-laborative Research Center 386, Discussion Paper 399.

Hegazy, Y.A. & Mayne, P.W. (2002). Objective site char-acterization using clustering of piezocone data. Journalof Geotechnical and Geoenvironmental Engineering,128(12):986-996.

Ioannides, J. & Robertson, P.K. (2016). CPeT-IT v.2.0 –CPT interpretation software. URL:https://geologismiki.gr/, accessed on July 26th 2019.

Jefferies, M.G. & Davies, M.P. (1991). Soil classificationby the cone penetration test: Discussion. Canadian Geo-technical Journal, 28(1):173-176.

Juang, C.H. & Chen, C.J. (1999). CPT-based liquefactionevaluation using artificial neural networks. Computer-


Carvalho & Ribeiro

Aided Civil and Infrastructure Engineering, 14(3):221-229.

Juang, C.H.; Yuan, H.; Lee, D.H. & Lin, P.S. (2003). Sim-plified cone penetration test-based method for evaluat-ing liquefaction resistance of soils. Journal of Geotech-nical and Geoenvironmental Engineering, 129(1):66-80.

Kohestani, V.R.; Hassanlourad, M. & Ardakani, A. (2015).Evaluation of liquefaction potential based on CPT datausing random forest. Natural Hazards, 79(2):1079-1089.

Kumar, J.K.; Konno, M. & Yasuda, N. (2000). Subsurfacesoil-geology interpolation using fuzzy neural network.Journal of Geotechnical and Geoenvironmental Engi-neering, 126(7):632-639.

Kurup, P.U. & Griffin, E.P. (2006). Prediction of soil com-position from CPT data using general regression neuralnetwork. Journal of Computing in Civil Engineering,20(4):281-289.

Liao, T. & Mayne, P.W. (2007). Stratigraphic delineationby three-dimensional clustering of piezocone data.Georisk, 1(2):102-119.

Livingston, G.; Piantedosi, M.; Kurup, P. & Sitharam, T.G.(2008). Using decision-tree learning to assess liquefac-tion potential from CPT and Vs. Proc. of the 4th Geo-technical Earthquake Engineering and Soil DynamicsCongress, Sacramento, California, v. 1, pp. 1-10.

Lunne, T.; Robertson, P.K. & Powell, J.J. (2002). ConePenetration Testing in Geotechnical Practice. Taylor &Francis Group, London and New York.

Mayne, P.W.; Peuchen, J. & Bouwmeester, D. (2010). SoilUnit Weight Estimated from CPTu in Offshore Soils.Gourvenec & White (eds.) Front Offshore Geotech,Taylor & Francis Group, London, pp. 371-376.

Mayne, P.W. (2014). Interpretation of geotechnical param-eters from seismic piezocone tests. Proc. of 3rd Interna-tional Symposium on Cone Penetration Testing,CPT14, Las Vegas, Nevada, v. 1, pp. 1-27.

Nemenyi, P.B. (1963). Distribution-Free Multiple Compar-isons. Ph.D. Thesis, Princeton University.

Ramsey, N. (2002). A calibrated model for the interpreta-tion of cone penetration tests (CPTs) in North Sea qua-ternary soils. Proc. of Offshore Site Investigation andGeotechnics Diversity and Sustainability, Society ofUnderwater Technology, London, v. 1, pp. 1-16.

Rao, A.; Janhavi, U.; Abhishek, G.N.; Manjunatha & Be-ham, R.A. (2016). Machine learning in soil classifica-tion and crop Detection. International Journal forScientific Research & Development, 4(1):792-794.

Reale, C.; Gavin, K.; Libric, L. & Juric-Kacunic, D. (2018).Automatic classification of fine-grained soils usingCPT measurements and Artificial Neural Networks.Advanced Engineering Informatics, 36:207-215.

Robertson, P.K.; Campanella, R.G.; Gillespie, D. & Greig,J. (1986). Use of Piezometer Cone Data. Clemence, S.P.

(ed.) Use of In Situ Tests in Geotechnical Engineering.American Society of Civil Engineers, New York,pp. 1263-1280.

Robertson, P.K. (1990). Soil classification using the conepenetration test. Canadian Geotechnical Journal,27(1):151-158.

Robertson, P.K. (1991). Soil classification using the conepenetration test: Reply. Canadian Geotechnical Journal,28(1):176-178.

Robertson, P.K. (2009). Interpretation of cone penetrationtests - a unified approach. Canadian Geotechnical Jour-nal, 46(11):1337-1355.

Robertson, P.K. (2016). Cone penetration test (CPT)-basedsoil behaviour type (SBT) classification system - an up-date. Canadian Geotechnical Journal, 53(12):1910-1927.

Rogiers, B.; Mallants, D.; Batelaan, O.; Gedeon, M.; Huys-mans, M. & Dassargues, A. (2017). Model-based clas-sification of CPT data and automated lithostratigraphicmapping for high-resolution characterization of a heter-ogeneous sedimentary aquifer. Plos One,12(5):e0176656.

Schaap, M.G.; Leij, F.J. & Van Genuchten, M.T. (1998).Neural network analysis for hierarchical prediction ofsoil hydraulic properties. Soil Science Society of Amer-ica Journal, 62(4):847-855.

Schneider, J.A.; Randolph, M.F.; Mayne, P.W. & Ramsey,N.R. (2008). Analysis of factors influencing soil classi-fication using normalized piezocone tip resistance andpore pressure parameters. Journal of geotechnical andgeoenvironmental engineering, 134(11):1569-1586.

Schneider, J.A.; Hotstream, J.N.; Mayne, P.W. & Ran-dolph, M.F. (2012). Comparing CPTU Q - F and Q -�u2/�’v0 soil classification charts. Geotechnique Letters,2(4):209-215.

Stone, M. (1974). Cross-validatory choice and assessmentof statistical predictions. Journal of the Royal StatisticalSociety: Series B (Methodological), 36(2):111-133.

Wilcoxon, F. (1945). Individual comparisons by rankingmethods. Biometrics bulletin, 1(6):80-83.

Wilson, D.L. (1972). Asymptotic properties of nearest nei-ghbor rules using edited data. IEEE Transactions onSystems, Man, and Cybernetics, SMC-2(3):408-421.

Zhang, G.; Robertson, P.K. & Brachman, R.W. (2002). Es-timating liquefaction-induced ground settlements fromCPT for level ground. Canadian Geotechnical Journal,39(5):1168-1180.

List of SymbolsBq: normalized excess pore pressureCG: class of geologyd: feature space dimensionalitydist: distance between pointsFr: normalized friction ratiofs: lateral friction



Ic: classification indexIQ: interquartile rangek: number of nearest neighbors

n: exponent of �v0’pa: reference pressureQ1: first quartileQ3: third quartileqc: cone resistanceqt: total cone resistanceQt1: normalized cone resistanceQtn: updated normalized cone resistance

Rf: friction ratioSD: standard deviationSt: sensitivityu0: equilibrium pore pressureu2: pore pressure measured behind the cone tipU2: updated normalized excess pore pressurew: Gaussian weightingxi, xj: points representing objectsz: depth�: soil unit weigh�v0: total overburden pressure�v0’: effective overburden pressure


Carvalho & Ribeiro

Date post:	13-Jan-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Soil Classification System from Cone Penetration Test Data ... · sand-like, clay-like and...

Documents