Priyank Srivastava (PE 5370: Mid- Term Project Report)
Contents Executive Summary ...................................................................................................................................... 2
PART- 1 Identify Electro facies from Given Logs using data mining algorithms ........................................ 3
Selection of wells ...................................................................................................................................... 3
Data cleaning and Preparation of data for input to data mining .............................................................. 3
Selection of data mining technique & Workflow ...................................................................................... 7
Mathematical Background of PCA and K-Means clustering ..................................................................... 9
Interpretation of Results ........................................................................................................................... 9
Relationship of Predicted Electro facies with original variables ......................................................... 10
“The folly of trusting Data mining” ......................................................................................................... 13
PART-2: Doing Clustering using SOM and “R” package ............................................................................ 13
Clustering and SOM in ‘R’ ....................................................................................................................... 14
PART-3: Clustering using Merged Dataset of all wells .............................................................................. 16
Conclusion ................................................................................................................................................... 17
Appendix – A: R Code for Part-III ................................................................................................................ 18
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Executive Summary
The Objective of present project is to prepare a data mining model to estimate electro facies from set of
open-hole well logs. This trained model can then be used as a predictive tool for estimating unknown logs
at any new location. Present workflow utilizes principal component analysis (PCA) and K-Means clustering
algorithm for preparation of data mining model.
This report is divided into three parts in part-I the data mining algorithm is run on individual wells which
uses different attributes for each well depending on the availability. The produced clusters are mapped
back to individual wells based on gamma ray values which broadly shows Facies 1 as high gamma ray,
Facies 3 as mix of sand shale sequence and Facies 2 as low gamma ray. Presence of these facies is then
correlated with corresponding production rates from different wells to figure out reservoir quality of each
facies. Though K-means always converges the answer given by K-means depends on the initial centers. It
also returns centers that are averages of data points. So some of the wells (Young Joe; Flanik Randal)
which do not have complete dataset doesn’t show any clusters and thus it is difficult to generalize the
interpretation from this Model. This part ends with discussing various disadvantages of K-means
clustering. We can predict the unknown logs in these wells using present data mining model but it is out
of scope of present project. The process of data mining helps uncovering the hidden patterns in the data
set by exposing the relationships between attributes. But the issue is that it uncovers a lot of unuseful
patterns. It is up to the domain expert to filter through the patterns and accept the ones that are valid to
answer the objective question. Thus, in part-II some of the wells are used for clustering using self-
organizing maps (SOM). In part-III, 5 attributes (GR, AT90, PEF, RHOB and NPHI) are merged for all the 10
selected wells and similar workflow (PCA+K-means) is run to generate a generalize model for three
clusters from which different facies and its characteristics are identified.
To conclude based on study in Part III, I can summarize my finding in following table
Cluster name Interpretation
1 Shales/Sands with low porosity (0.09) and resistivity (9.12). Probably tight shales with high clay bound water (since , high Nphi 0.289)
3 Shales/Sands with very low porosity (0.038) but higher resistivity (16.26) and grain density than facies 1. Probably contains hydrocarbon saturation and less water
2 Probably the hottest spot in this region with good porosity and high Hydrocarbon saturation. So the well with highest amount of Facies 2 will be the most prolific producer.
Priyank Srivastava (PE 5370: Mid- Term Project Report)
PART- 1 Identify Electro facies from Given Logs using data mining algorithms
Selection of wells I choose the wells according to their API numbers so 10 wells in county parker (API: 42-367) were chosen.
But not all wells have equal amount of data while some wells have processed logs some don’t have it. The
table below gives the API numbers with corresponding well name and Production rate for the chosen
wells.
API’s Well name Production rate* (Mscf/day)
42-367-34050 Moore --
42-367-34447 Deaton 202
42-367-34576 Frank-Mask 830
42-367-34094 Sugar Tree 532
42-367-34227 Westhoff John 1029
42-367-34343 Flamik Randal 201
42-367-34385 Young Joe 779
42-367-34438 Kinyon 493
42-367-34744 Hagler 1365
42-367-34883 Lake Wheatherford 965
*From Drillininginfo.com
Based on the production rate, the wells can be divided in three categories. Our Goal in this project is to
(1) classify each well in electro-facies. (2) If i can relate the performance of well with newly classified
electro facies.
Data cleaning and Preparation of data for input to data mining Since the logs given to us were processed and contains many redundant and missing parameters. It
becomes imperative to select and clean the data for selection of attribute we want as input to data
mining algorithms. We want to develop electro facies for upper Barnett and lower Barnett zones local
stratigraphy of subsurface is given in Figure 1 as observed Barnett shales is divided in two parts by
forestburg limestone Thus, before inputting data in any data mining algorithm we need to get rid of
these limestone zones. Since in all of the given logs resistivity of mud is of order of 0.4 Ohm-meter we
can be sure that all the wells are drilled by water-based muds and hence we can use Photoelectric (PE)
log as lithology indicator since carbonates usually have high PE values of 5. We can easily screen out all
the values of log which shows PE < 4. Additional filtering is done by screening out all depths which
shows Density (RHOB) >2.7 gm/cc. Figure 2 shows the workflow used for cleaning and filtering of depth
so that our final output is depth and parameters of only upper and lower Barnett shale.
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Figure 3 contains the list of attributes selected for each well. It can be observed that flamik randal and
young Joe well contains least amount of attributes.
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Figure 1 : General stratigraphy of the Ordovician to Pennsylvanian section in fort-worth basin (Loucks & Ruppel, 2007)
Figure 2: Workflow for Data cleaning
Select all the depths with PEF < 4
Select all the depths with non zero GR , RHOB , AT90 and 0<NPHI <1
Normalize every parameter with its mean and variance
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Figure 3: Table listed below gives the summary of different meaningful curves which could be extracted from each well.
• GR(Max:368;Min:18)
• PEF(Max:6.2;Min:2.2)
• AT90(Max:862;Min:0.68)• NPHI(Max:0.397;Min:.002)
• RHOB(Max:2.76;Min:2.34)
• WCLC(AVE: 0.183)
• WILL(AVE:0.69)
• WQUA(AVE:0.471)
• VCL(AVE:0.332)
Moore ( 9 Attributes)
• GR(Max:337;Min:12)
• PEF(Max:5.18;Min:1.8)
• AT90
• NPHI(Max:0.374;Min:0)
• RHOB(Max:2.825;Min:2.39)
• WCLC(AVE: 0.176)
• WDOL(AVE:0.096)
• WILL(AVE:0.136)
• WQUA(AVE:0.474)
• WTOC(AVE:0.022)
• VCL(AVE:0.237)
Deaton(11 Attributes)
• GR(Max:346;Min:0)
• NPHI(Max:0.30;Min:0)
• RHOB(Max:2.705;Min:0)
• VCL(AVE:0.289)
• PR (AVE: 0.227)
• CB (0.205)
Frank Mask(6 Attributes)
• GR(Max:201;Min:0)
• PEF(Min:0;Max:9.776)
• AT90(Min:0.224;Max:173)
• NPHI(Min:-0.014;Max:0.569)
• RHOB(Min:2.75;Max:0.30)
• WILL
• WQUA
• VCL
• PR
• BULKMOD
Sugartree (10 Attributes)
• GR(Max:368;Min:18)
• PEF(Min:2.28;Max:6.234)
• NPHI(Min:0.002;Max:0.397
• RHOB(Max:2.76;Min:2.34)
• WCAR (AVE:0.025)
• WCLC(AVE:0.183)
• WILL(AVE:0.311)
• WQUA(AVE:0.471)
• VCL(AVE:0.332)
Westhoff John (9 Attributes)
• GR(Min:0,Max:883)
• PEF(Min:0,Max:11.54)
• AT90(Min:0,Max:927)
• NPHI(Min:0,Max:2.7)
• RHOB(Min:0;Max:164)
Flamik Randal (5 Attributes)
• GR(Min:0,Max:883)
• PEF(Min:0,Max:11.54)
• AT90
• NPHI
• RHOB
Young Joe (5 Attributes)
• GR
• PEF
• AT90
• NPHI
• RHOB
• PR
• YME
Kinyon (7 Attributes)
• GR
• PEF
• AT90
• NPHI
• RHOB
• WCLC
• WILL
• WQUA
• VCL
Hagler (9 Attributes)
• GR
• PEF
• AT90
• NPHI
• RHOB
• WILL
• WQUA
• WPYR
Lake whetherford
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Selection of data mining technique & Workflow Due to high volume of log data. It is desirable to choose unsupervised data mining techniques to first find
out if our data contains any hidden trends or patterns. Since many wells have log attributes as high as
200. So, it becomes necessary to first reduce the dimensionality of data before applying any clustering
algorithm. I use principal component analysis (PCA) to first reduce the dimensionality of data in three
principal components and consequently use K-means clustering algorithm to optimize and generate
clusters in the data. Figure 4 gives PCA & clustering density plots for different wells in sequence. Clustering
is done using X-means algorithm which automatically optimizes number of clusters by iteration. However,
due to uneven size of clustering as shown in Fig-4 it can be argued successfully that this method is not
giving us the right clusters that we want since in the quest to minimize the within cluster sum of squares
error , the X-means clustering gave more weight to larger clusters. Thus, to conclude this clustering
technique could not be applied in this case since K-means assumes that each cluster have roughly equal
number of observations. Also, PCA is the methodology which is applied to correlated attributes since
presence of variance in any one direction is necessary so if the data doesn’t show any correlation than
applying PCA is not a meaningful task.
Table 1 : Parameters used in X-means clustering and PCA analysis
PCA No. of components selected based of keeping variance of 90%
X- Means Clustering
Min. clusters 2
Max. clusters 60
Numerical measures Euclidean distances
Max. runs 10
Max. Optimization steps 100
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Figure 4 : PCA Density Plots with X- Means clustering for following wells in order from top left 1. Moore 2. Deaton 3. Frankmask 4. Sugar tree 5. Westhoff John 6. Flanik Randal 7. Young Joe 8. Kinyon 9. Hagler 10. Lake Wheatherford. While Using X-Means clustering most of the wells can be described by three clusters in PCA data but Well 6 & 7 does not display any specific clusters.
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Mathematical Background of PCA and K-Means clustering PCA is the dimensionality reduction technique to reduce dimensionality of data for a correlated attribute
dataset. The 1st principal component is the direction of maximum variance in data. While each principal
component is independent and orthogonal to each other. Every attribute needs to be scaled before
applying PCA algorithm to it. PCA is a very useful tool for exploratory data analysis and predictive
modelling of huge dimension dataset. While PCA helps to see internal patterns in data next step for data
mining is Clustering, although literature is rich with many different algorithms for efficient way to do
clustering fundamental workflow for clustering is shown in
Table 2
Table 2 : Workflow for clustering algorithms
Interpretation of Results Since Principal components as such does not have any physical meanings. I have to transform the
predicted clusters back to the original data.
Table below gives the distribution of data-points in different clusters for all the analyzed wells:
Well name No. of data points used in analyses after cleaning
Data points in cluster 1
Data point in cluster 2
Data point in cluster 3
Data point in cluster 4
Moore 1884 629 467 788 --
Deaton 2264 1729 125 410 --
Frank mask 3212 2642 570 -- --
Sugar tree 925 581 56 288 --
Westhoff john 8016 6539 1477 -- --
Flanik Randal 500 240 115 124 21
Young Joe 80 37 8 35 --
Kinyon 6462 538 1680 4244 --
Hagler 2535 1211 801 523 --
Wheatherford lake 5085 1178 3121 786 --
Determine No. of Clusters (Centroids) to
be placed
Find distance of each data point to each centroid and assign
centroid to each data point based on
minimizing sum of distance distance
find centroid of the clusters done in first
iteration and reclassify each
datapoint to it's cluster
recompute centroid and reclassify based
on minimizing sum of distances from
centroid
Iterate until things converge and number of clusters optimizes.
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Relationship of Predicted Electro facies with original variables
Figure 5 : Moore well can be subdivided into three electro facies using data mining which can be correlated with gamma ray values. Facies 1 shows high gamma ray and are most probably shale interval while facies 2 have lesser radioactivity as compare to facies 1. Facies 3 have the lowest gamma ray reading.
4400
4600
4800
5000
5200
5400
5600
0 50 100 150 200 250 300 350 400D
epth
GR & Electrofacies For Moore Well
GR ELECTROFACIES
Facies 1 Dominated
Facies 3 Dominated
Facies 2 Dominated
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Figure 6 : Deaton well seem to contain only facies 1 and facies 3. While amount of facies 2 is very less. In Frank mask well only two type of facies is present but it is not easy to classify them just based on gamma ray log.
4900
5100
5300
5500
5700
5900
6100
0 100 200 300 400
Dep
th
GR & Electrofacies for Deaton Well
GRELECTROFACIES
Facies 1 Dominated
Facies 3 Dominated
Facies 1 Dominated
5400
5600
5800
6000
6200
6400
6600
6800
0 100 200 300 400
Dep
th
GR & Electrofacies Frank mask
GRELECTROFACIES
Facies 1 Dominated
Facies 2 Dominated
Facies 1 Dominated
Priyank Srivastava (PE 5370: Mid- Term Project Report)
5600
5800
6000
6200
6400
6600
6800
7000
0 100 200 300 400
Dep
thGR & Electrofacies Kinyon
GR
Facies 3
Facies 2
Facies 1
5600
5800
6000
6200
6400
6600
6800
7000
0 100 200 300 400
Dep
th
GR & Electrofacies Hagler
GR
Facies 3
Facies 2
Facies 1
Priyank Srivastava (PE 5370: Mid- Term Project Report)
“The folly of trusting Data mining” Most of Data mining algorithm are heuristic processes in which no physical understanding is needed for
application of any process. The process of data mining is suppose to show us hidden trends. However,
applying any data mining task blindly can lead to completely wrong outputs. Given below are some of the
caveats of using K-means clustering to real life dataset.
1. K-means assumes the variance of the distribution of each attribute is spherical
2. Doesn’t work on spherical dataset
Usually higher the dimensions of data more difficult is applying K-means to it efficiently.
3. The Curse of Unevenly sized clusters
K-means assumes the prior probability for all K clusters are the same i.e. each cluster has roughly equal
number of observations. Which is obviously not the same with our dataset.
PART-2: Doing Clustering using SOM and “R” package Figure 7 Shows use of self-organizing maps U matrix plot with K means clustering for all the wells using
same attributes as used in part-1
Figure 7 : SOM clustering for Moore well
Priyank Srivastava (PE 5370: Mid- Term Project Report)
However, again it is difficult to evaluate the accuracy of clustering.
Clustering and SOM in ‘R’ Since ‘R’ provides some flexibility and quality checks for clustering. The filtered data obtained from part-
1 data cleaning workflow with additional constraint of GR value >120 is used as an input to R and I used
K-means clustering technique to see how it performs. This is done for following four wells Moore, Deaton,
Frankmask, Kinyon. This section describes the results of using ‘R’.
Figure 8 : Clustering Optimization for Moore well
Figure 9: Clustering optimization of Deaton Well
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Figure 10 : Clustering Optimization for Frank mask well
Figure 11: Clustering optimization of Kinyon Well
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Figure 12 : Clustering optimization of Hagler Well
PART-3: Clustering using Merged Dataset of all wells Names of selected wells. This time I just used the wells which contains all these 5 curves i.e. GR, AT90,
PEF, NPHI, and RHOB. Following wells were selected for the analysis
Bonds ranch C-1
Hyder 1H
Jerome Russell
John W Porter 3
Massey Unit
McFarland-Dixon
Moore-Price
Sol Carpenter Heirs
Sugar tree
Upham Joe Johnson
Applying the same workflow to merged dataset gives following three clusters as given in
Figure 13 : PCA clusters for merged dataset
Priyank Srivastava (PE 5370: Mid- Term Project Report)
The table below gives centroid for each cluster
Cluster number
PC1 PC2 Avg. GR
(API)
Avg. DPHI
Avg. PEF
Avg. At 90
Avg. RHOB
Avg. NPHI
2 -1.455 0.08 154 0.124 3.13 152 2.49 0.177
3 1.5113 0.8375 137 0.038 3.19 16.26 2.64 0.191
1 1.2253 -1.647 134 0.09 3.33 9.12 2.55 0.289
Conclusion The clusters can be interpreted as follows:
Cluster name Interpretation
1 Shales/Sands with low porosity (0.09) and resistivity (9.12). Probably tight shales with high clay bound water (since , high Nphi 0.289)
3 Shales/Sands with very low porosity (0.038) but higher resistivity (16.26) and grain density than facies 1. Probably contains hydrocarbon saturation and less water
2 Probably the hottest spot in this region with good porosity and high Hydrocarbon saturation. So the well with highest amount of Facies 2 will be the most prolific
producer.
Priyank Srivastava (PE 5370: Mid- Term Project Report)
Appendix – A: R Code for Part-III setwd("C:/Users/priya/Desktop/DMP_midterm/R") ms<-read.table("Book1_final.csv",header = TRUE ,sep = ",") ms[is.na(ms)]<-0 attach(ms) ls.str(ms) #na.rm=true #x[!is.na(x)] ms<-ms[ ,c(1,2,4,5,6,7,8)] #removing values of PEF>4 and GR<120 msfilter<-ms[(ms$PEF<4&ms$GR>110),] ##Doing k means clustering in r par(mfrow=row(1,3),mar=c(4,4,2,1)) #mydata<-scale(msfilter) ##applying PCA for sacled variable mspca<-prcomp(msfilter,center=TRUE , scale=TRUE, retx=TRUE) fulldata<-data.frame(msfilter,mspca$x) mydata<-mspca$x # Determine number of clusters wss <- (nrow(xmydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) dev.copy(pdf,"myplot.pdf") plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") fit<-kmeans(mydata,3,iter.max = 100 , nstart=50) #get cluster means aggregate(mydata,by=list(fit$cluster),FUN=mean) #append cluster assignment mydata<-data.frame(fulldata,fit$cluster) library(cluster) clusplot(mydata,fit$cluster,color=TRUE,shade=TRUE,labels=0,lines=0) write.table(mydata,"C:/Users/priya/Desktop/DMP_midterm/R/mergeddata.txt",sep="\t")