wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/applications.docx · Web viewDATA MINING...

DATA MINING APPLICATIONS

Introduction

Data Mining is widely used in diverse areas. There are number of commercial data mining system available today yet there are many challenges in this field. In this tutorial we will applications and trend of Data Mining.

Data Mining Applications

Here is the list of areas where data mining is widely used:

Financial Data Analysis

Retail Industry

Telecommunication Industry

Biological Data Analysis

Other Scientific Applications

Intrusion Detection

Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of high quality which facilitates the systematic data analysis and data mining. Here are the few typical cases:

Design and construction of data warehouses for multidimensional data analysis and data mining.

Loan payment prediction and customer credit policy analysis.

Classification and clustering of customers for targeted marketing.

Detection of money laundering and other financial crimes.

Retail Industry

Data Mining has its great application in Retail Industry because it collects large amount data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of increasing ease, availability and popularity of web.

The Data Mining in Retail Industry helps in identifying customer buying patterns and trends. That leads to improved quality of customer service and good customer retention and satisfaction. Here is the list of examples of data mining in retail industry:

Design and Construction of data warehouses based on benefits of data mining.

Multidimensional analysis of sales, customers, products, time and region.

Analysis of effectiveness of sales campaigns.

Customer Retention.

Product recommendation and cross-referencing of items.

Telecommunication Industry

Today the Telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, Internet messenger, images, e-mail, web data transmission etc.Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business.

Data Mining in Telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service. Here is the list examples for which data mining improve telecommunication services:

Multidimensional Analysis of Telecommunication data.

Fraudulent pattern analysis.

Identification of unusual patterns.

Multidimensional association and sequential patterns analysis.

Mobile Telecommunication services.

Use of visualization tools in telecommunication data analysis.

Biological Data Analysis

Now a days we see that there is vast growth in field of biology such as genomics, proteomics, functional Genomics and biomedical research.Biological data mining is very important part of Bioinformatics. Following are the aspects in which Data mining contribute for biological data analysis:

Semantic integration of heterogeneous , distributed genomic and proteomic databases.

Alignment, indexing , similarity search and comparative analysis multiple nucleotide sequences.

Discovery of structural patterns and analysis of genetic networks and protein pathways.

Association and path analysis.

Visualization tools in genetic data analysis.

Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous data sets for which the statistical techniques are appropriate. Huge amount of data have been collected from scientific domains such as geosciences, astronomy etc. There is large amount of data sets being generated because of the fast numerical simulations in various fields such as climate, and ecosystem modeling, chemical engineering, fluid dynamics etc. Following are the applications of data mining in field of Scientific Applications:

Data Warehouses and data preprocessing.

Graph-based mining.

Visualization and domain specific knowledge.

Intrusion Detection

Intrusion refers to any kind of action that threatens integrity, confidentiality, or availability of network resources. In this world of connectivity security has become the major issue. With increased usage of internet and availability of tools and tricks for intruding and attacking network prompted intrusion detection to become a critical component of network administration. Here is the list of areas in which data mining technology may be applied for intrusion detection:

Development of data mining algorithm for intrusion detection.

Association and correlation analysis, aggregation to help select and build discriminating attributes.

Analysis of Stream data.

Data Objects

Data sets are made up of data objects. A data object represents an entity. Examples:

sales database: customers, store items, sales medical database: patients, treatments

university database: students, professors, courses Also called samples , examples, instances, data points,

objects, tuples. Data objects are described by attributes. Database rows -> data objects; columns ->attributes.

Attributes

Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object.

E.g., customer _ID, name, address Types:

Nominal Binary Numeric: quantitative

o Interval-scaledo Ratio-scaled

Attribute Types

Nominal: categories, states, or “names of things” Hair_color = {auburn, black, blond, brown, grey, red,

white} marital status, occupation, ID numbers, zip codes

Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important

o e.g., gender Asymmetric binary: outcomes not equally important.

o e.g., medical test (positive vs. negative)

o Convention: assign 1 to most important outcome (e.g., HIV positive)

Ordinal Values have a meaningful order (ranking) but

magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings

Numeric Attribute Types

Quantity (integer or real-valued) Interval

o Measured on a scale of equal-sized unitso Values have order

o E.g., temperature in C˚or F˚, calendar dateso No true zero-point

Ratioo Inherent zero-pointo We can speak of values as being an order of

magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).o e.g., temperature in Kelvin, length, counts,

monetary quantities

Discrete vs. Continuous Attributes

Discrete Attribute Has only a finite or countably infinite set of values

o E.g., zip codes, profession, or the set of words in a collection of documents

Sometimes, represented as integer variables Note: Binary attributes are a special case of discrete

attributes Continuous Attribute

Has real numbers as attribute valueso E.g., temperature, height, or weight

Practically, real values can only be measured and represented using a finite number of digits

Continuous attributes are typically represented as floating-point variables

Chapter 3. Summaring Data: Statistical descriptionsBesides being the subject matter of the course, the word statistics is also the plural of the word statistic, which is a quantity computed from sample data, while a parameter is a quantity computed from the data from the whole population.A statistical description is a synonym of statistic.

Statistical descriptions are usually classified by the features of the sample data that they are trying to describe. The most common ones are measures of location or center, which are indicative of the 'center' of the data, while the measures of variation are indicative of the variability of the data.

Measures of Location: The meanSample meanThe sample mean is obtained by adding all the values in your sample and dividing by the sample size (which is usually denoted by small n). In mathematical notation, we have

(1)

x¯=∑xn

Notice that the symbol for the sample mean is x¯, an x with a bar above, and it is read x bar. The symbol

Σ is the sum sign in mathematics, which means that you should add up all the values in your sample.

Say, for example that you collected the age of 4 students in the class to estimate the average age of the whole class. Then the 4 students are the sample, and the whole class the population.

The population size is n=4.

If the ages you collected are 22, 21, 48, and 21, then the sample values are written as

(2)

x1=22, x2=21, x3=48, x4=21.

and for this particular example, the mean is

(3)

x¯=∑4i=1xi4=x1+x2+x3+x44=22+21+48+214=1124=28

Population meanIf we were to calculate the mean from a population instead of a sample, then we would still proceed in the same way, we would add up all the values in the population (more values to add) and we would divide by the population size (denoted by N).

The symbol for the population mean is the Greek letter mu: μ, so we obtain(4)

μ=∑xN

In the case of the population of students in the classroom with the following ages: 21 21 25 20 26 22 46 25 58 24 20 20 25 23 27 21 23 22 28, the population mean is 26.2.

In this case, it is understood that the ∑ (Sigma)- sign indicates the sum over all the elements of the population (not only the sample), therefore, in this case, the sum indicatesx1+x2+x3+...+xN, where in this case N= 20, instead of x1+x2+...+xn in the case of the sample mean, where n=4 in our example above. When we want to make this difference more evident, then we write ∑Ni=1 for the population, and ∑ni=1 for the sample.

Properties of the mean

The mean as a measure of center is probably the most frequently used measure of center, partly because it has the following properties:

1. it can be calculated for any numerical data set.2. it's value is unambiguous and unique for a given data set

3. it lends itself to further statistical treatment4. if each value in the sample would be replaced by the mean, ∑x would remain

unchanged5. it takes into account every value in the data set6. it is relatively reliable (does not fluctuate widely when selecting different samples)

However, outliers have a strong effect on the mean, particularly if the sample size is small. Therefore, at times the trimmed mean is used instead, in which case the upper and lower 5% is deleted, and the mean is taken without those values.

In our population above, we would trim the data set 58 and the data set 20, and then add the values and divide by 18, to obtain 23.3.

Measures of Location: The weighted meanWhen taking averages, it is often important to take into account that data have different weights; some data points are more "important" than others. For example, if we have the household income per state, we see that Alaska has the 4th highest median household income (64,333) , while New York has the 19th highest (53,514).If we want to calculate the national average, however, New York's value counts much more than Alaska's because New York's population is much larger than Alaska's (19,490,297 vs. 686,293).

Therefore, to get the correct mean, we would have to weight each state's household income with the appropriate weight, which is a measure of relative importance (in this case the population size).

The trimmed mean is given by

(5)

xw=∑w⋅x∑w=∑ni=1wi⋅xi∑ni=1wi=w1x1+x2x2+⋯+wnxnw1+w2+⋯+wn

For example, suppose that we have a sample with the 2 states named above and Texas ($47,548) and Mississippi ($36,338), with corresponding populations 24,326,974 and 2,938,618 respectively.

Then, the weighted mean is

(6)

xw=19490297⋅53514+686,293⋅64333+24326974⋅47548+2938618⋅363381

9,490,297+686,293+24,326,974+2,938,618

which gives a weighted average of $49, 547. The national median household income is $50,740.

Grand mean

http://en.wikipedia.org/wiki/Household_income_in_the_United_States#Income_by_state

A special case of the formula for the weighted average is the grand mean, which is the overall mean. In that case, it has a special notation and slightly differenct formula:

(7)

x¯¯=∑ki=1ni⋅xi¯∑ki=1ni=n1x1¯+n2x2¯+⋯+nkxn¯n1+n2+⋯+nk

Measures of Location: The Median and other fractilesThe median is a measure of center, like the mean, which is not affected by outliers like the mean is. The symbol used for the sample median is x~and for the population median μ~.

To obtain the median, we first need to re-arrange the data in ascending order (sorted), and then find the middle value, namely,

when n is odd, the median is the value in the middle (after sorting) when n is even, it is the mean of the two items nearest to the middle

For example, suppose that we select a sample of size 4 from the population of students in the class listed above, obtaining the following ages: 20, 26, 22, 46.

Then, since n=4, the median is the average of the two values in the middle (after sorting, hence the average of 22 and 26, or x~=24.If we added another value to the sample, say 28, then the sorted sample would be 20, 22, 26, 28, and 46, and hence the median would be x~=26.

When dealing with a small population, or larger sample, it is sometimes convenient to make a stem-and-leaf plot to find the median. The reason for this is that the stem-and-leaf plot sorts the data.

For example, in the case of the GDP per capita of the 20 countries listed in chapter 2, we got the following stem and leaf plot:

Since n=20, we need to find the average between the values in the 10th and 11th position, namely 45 and 46 thousand USD. Therefore, the median GDP per capita, for the 20 richest countries (per capita) is 45,500 USD.

http://statistics.wikidot.com/local--files/ch3/StemAndLeaf.tiff

Fractiles

The median is one example of a fractile, a measure (value) that divides the data set into two or more equal parts. Another such example are quartiles, that divide the data into 4 equal parts, obtaining the values Q1, Q2, and Q3. Q4 is the maximum value, so no need to calculate it. Namely, Q1 is the value such that 25% of the data fall below it, and Q3 is such that 75% of the data fall below it, while Q2 is the median, and hence 50% of the data values fall below.

In the list above, there is 5 countries whose GDP per capita is 39,000 and the next one has a GDP per capita of 40,000, hence, Q1=39,500. Similarly, Q2 = 45,500 = x~ and Q3 = 55,500.

Boxplot

A box plot is a graph that summarizes the data by representing 5 values, the minimum and maximum value, the Q1, Q2 and Q3.

In the graph below, the 3 central horizontal lines represent Q1, Q2 and Q3, while the point on the extreme represents an outlier value. This is a variation of the boxplot as described in the previous line, since the other two horizontal lines cannot be the minimum and maximum. They are the 5th and 95th percentile.

http://statistics.wikidot.com/local--files/ch3/boxplot.jpg

Measures of Location: The modeThe mode is a measure of center that is usually used for data that is non-numerical. Namely, the mode is the value that occurs most frequently, and it is the only value that can be collected for qualitative data.

For example, in example 3.17, they list the size of dresses sold by a store to be 10,7, 14, 9, 9, 14, 18, 9, 11, 12, 16, 14, 9, 14, 14, 11, 9 and 20. In this case, the number 9 and 14 appear most frequently, both showing exactly 5 times.Therefore, you would say in this case that this data is bimodal (has two modes), namely 9 and 14.

Measures of Variation: The rangeThe range is a measure of variation or variability of the data. The range of a data set is the largest value minus the smallest value.

For example, for the age distribution in the class, which we write here again for convenience:21 21 25 20 26 22 46 25 58 24 20 20 25 23 27 21 23 22 28,

the range is 58-20 = 38 years.

One disadvantage of the range is that it is highly influenced by outliers. For that reason, we use the following measure of variation more often.

Measures of Variation: The standard deviationThe standard deviation is the most general measure of variation.

To calculate the standard deviation of a population, one first takes the difference of each data point to the mean (the variation), and squares that difference (to insure it's positive). Then, all those squared differencesare added together and divided by N, the size of the population. Finally, the square root is taken. This is summarized in the following formula:

Population standard deviation(8)

σ=∑(x−μ)2N−−−−−−−−−√

The square of this quantity is called the population variance, and even though it is not a measure of variation per se, it is a notion that we will widely use during the course.

Population variance(9)

σ2=∑(x−μ)2N

Population exampleSuppose that we had 20 students in the classroom with the following ages:21 21 25 20 26 22 46 25 58 24 20 20 25 23 27 21 23 22 28 23.

The mean μ=26 in that case.

Then, one way to calculate the standard deviation of the population would be to do the following table:

x x-μ(x- μ)2

21 21 - 26=-5 25

21 21 - 26=-5 25

25 25 - 26=-1 1

20 20 - 26=-6 36

and so on, until the last two rows:

28 28 - 26=2 4

23 23 - 26=-3 9

sum

= 1678

Therefore, the variance of the population is σ2=1678/20=83.9, and the standard deviation is the square root of this result, namely σ=9.16.Sample standard deviationThe sample standard deviation is calculated almost in the same way as the population standard deviation, but substituting the sample mean x¯ for the population mean σ and the population size N for the sample size minus 1: n-1.

(10)

s=∑(x−x¯)2n−1−−−−−−−−−√

The square of this quantity is called the sample variance.

Sample variance(11)

s2=∑(x−x¯)2n−1

Sample exampleSuppose that we take a sample of 4 students out of the population listed above, say the ones with ages 26 22 46 25.

Then the sample standard deviation can be calculated by though the table:

x x- x¯(x-

x¯)2

27 27 - 30=-3 9

22 22 - 30=-8 64

46 46 - 30=16 256

25 25 - 30=-5 5

sum = 354

This total we would have to divide by n-1=3 to obtain the sample variance s2=118.The sample standard deviation is then s=10.86.

"Fast" formulaAn alternative but equivalent formula for calculating the sample standard deviation is

(12)

s=Sxxn−1−−−−−√

where

(13)

Sxx=∑x2−(∑x)2n.

The advantage of this formula is that you need one less column to calculate the sample standard deviation, which could lead to faster calculations, specially when a lot of data is involved. For the example above, we obtain:

x x2

27 729

22 484

46 2116

25 625

∑x=120

∑x=3954

Therefore,

Sxx=3954-(120)2/4=354

Dividing this by n-1=3, we obtain 118. This is again the value of the sample variance, and therefore we obtain the same value for the sample standard deviation, namely, s=10.86.

Application of the standard deviationChebyshev's theoremFor any set of data and any constant k greater than one, at least 1-1/k2 of the data must lie within k standard deviations on either side of the mean.

This theorem gives us a broad bound of how much data should be inside the mean plus or minus k standard deviations.

For example, when k=3, it is telling us that at least

1-1/(32)=1-1/9=8/9 or approximately 89%

of the data must lie within 3 standard deviations from the mean.

We will see in this course that many data sets follow a normal distribution, which is a bell-shaped distribution like the one in the graph below:

For the normal distribution, Chebyshev's theorem applies, but there is actually a more precise empirical rule:

1. About 68% of all the data values lie within 1 standard deviation from the mean2. About 95% of all the data values lie within 2 standard deviation from the mean3. About 99.7% of all the data values lie within 3 standard deviation from the mean

For example, if the height of women in the US is normally distributed with a mean of 64 inches and a standard deviation of 2.5 inches, this implies that 95% of all women in the US are between 59 and 69 inches tall.

z-scoresThe z-score is a measure of relative value which is useful to compare values from different data sets, it is calculated using the following formula in the case of a population:

(14)

z=x−μσ

and in the case of the sample by

(15)

z=x−x¯s.

For example, Rebecca Lobo, a female basketball player, is 76 inches tall, therefore here z-score is

(16)

z=76−642.5=4.8,

http://www.wnba.com/playerfile/rebecca_lobo/index.html

http://statistics.wikidot.com/local--files/ch3/NormDist.tiff

using the values for the population mean and standard deviation given above.

In other words, she is 4.8 standard deviations higher than the average US woman. That is extraordinarily tall!

Coefficient of variationA measure that allows us to compare variation from different data sets is the coefficient of variation, given by

V=sx¯⋅100% or V=σμ⋅100%For example, if one statistics class averaged 75 with a standard deviation of 10 points and another one averaged 65 with a standard deviation of 8 points, the coefficient of variation would allow us to find which class has less variability (is more homogeneous).

The coefficients of variation are 1075100%=13.3% and 865100%=12.3%, so that the second class is more homogeneous (or consistent).

Grouped dataWhen we create or receive a frequency distribution, the data has been grouped already, and therefore we have lost some of the original information.

For example, in the last section we saw the frequency distribution of GDP per capita of the 20 richest countries:

GDP range Number of countries

80,000 -89,999

1

70,000-79,999 2

60,000-69,999 4

50,000-59,999 0

40,000-49,999 8

30,000-39,999 5

Total 20

Even though we have lost some information, we can still calculate and approximate mean and standard deviation. Namely, let f1,f2,…fk be the class frequencies, and let x1,x2,…xk be the midpoints of every class, then we can approximate the mean byx¯=∑xkfkn

and the standard deviation using the "fast formula" by

s=Sxxn−1−−−√ where Sxx=∑x2f−(∑xf)2n

Extending the table above, we obtain

GDP range f x xf x2f

80,000 -89,999 1 85K 85 7225

70,000-79,999 2 75K 150 11250

60,000-69,999 4 65K 260 16900

50,000-59,999 0 55K 0 0

40,000-49,999 8 45K 350 16200

30,000-39,999 5 35K 175 6125

Total 20 1030 57700

Therefore,

x¯=103020=$51.5KSxx=57,700−(1030)2/20=4655Therefore the variance from grouped data is 4655/19=245.52 and the standard deviation is 15.67.

SkewnessWhen the data in a distribution is tilted to the left of center or to the right of center, then you say that it is skewed, namely skewed to the left (positive skew) or skewed to the right

(negative skew), respectively, as can be seen in the image below.

If the data is not skewed either way, then we call it symmetric, like in the case of the normal distribution depicted above.

Data preprocessingWhy preprocessing ?

1. Real world data are generallyo Incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate datao Noisy: containing errors or outlierso Inconsistent: containing discrepancies in codes or names

2. Tasks in data preprocessingo Data cleaning: fill in missing values, smooth noisy data, identify or

remove outliers, and resolve inconsistencies.o Data integration: using multiple databases, data cubes, or files.o Data transformation: normalization and aggregation.o Data reduction: reducing the volume but producing the same or similar

analytical results.o Data discretization: part of data reduction, replacing numerical

attributes with nominal ones.

Data cleaning

1. Fill in missing values (attribute or class value):o Ignore the tuple: usually done when class label is missing.o Use the attribute mean (or majority nominal value) to fill in the missing

value.

http://statistics.wikidot.com/local--files/ch3/kurt.tiff

o Use the attribute mean (or majority nominal value) for all samples belonging to the same class.

o Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.

2. Identify outliers and smooth out noisy data:o Binning

Sort the attribute values and partition them into bins (see "Unsupervised discretization" below);

Then smooth by bin means, bin median, or bin boundaries.o Clustering: group values in clusters and then detect and remove outliers

(automatic or manual)o Regression: smooth by fitting the data into regression functions.

3. Correct inconsistent data: use domain knowledge or expert decision.

Data transformation

1. Normalization:o Scaling attribute values to fall within a specified range.

Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min)

o Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev

2. Aggregation: moving up in the concept hierarchy on numeric attributes.3. Generalization: moving up in the concept hierarchy on nominal attributes.4. Attribute construction: replacing or adding new attributes inferred by existing

attributes.

Data reduction

1. Reducing the number of attributeso Data cube aggregation: applying roll-up, slice or dice operations.o Removing irrelevant attributes: attribute selection (filtering and

wrapper methods), searching the attribute space (see Lecture 5: Attribute-oriented analysis).

o Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data..

2. Reducing the number of attribute values

o Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).

o Clustering: grouping values in clusters.o Aggregation or generalization

3. Reducing the number of tupleso Sampling

Discretization and generating concept hierarchies

1. Unsupervised discretization - class variable is not used.o Equal-interval (equiwidth) binning: split the whole range of numbers in

intervals with equal size.o Equal-frequency (equidepth) binning: use intervals containing equal

number of values.2. Supervised discretization - uses the values of the class variable.

o Using class boundaries. Three steps: Sort values. Place breakpoints between values belonging to different classes. If too many intervals, merge intervals with equal or similar class

distributions.o Entropy (information)-based discretization. Example:

Information in a class distribution: Denote a set of five values occurring in tuples belonging to

two classes (+ and -) as [+,+,+,-,-] That is, the first 3 belong to "+" tuples and the last 2 - to

"-" tuples Then, Info([+,+,+,-,-]) = -(3/5)*log(3/5)-

(2/5)*log(2/5) (logs are base 2) 3/5 and 2/5 are relative frequencies (probabilities) Ignoring the order of the values, we can use the following

notation: [3,2] meaning 3 values from one class and 2 - from the other.

Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)

Information in a split (2/5 and 3/5 are weight coefficients): Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) +

(3/5)*Info([+,-,-])

Or, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])

Method: Sort the values;

Calculate information in all possible splits; Choose the split that minimizes information; Do not include breakpoints between values belonging to

the same class (this will increase information); Apply the same to the resulting intervals until some

stopping criterion is satisfied.3. Generating concept hierarchies: recursively applying partitioning or

discretization methods.

data visualization

Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a visual context. Patterns, trends and correlations that might go undetected in text-based data can be exposed and recognized easier with data visualization software.

Today's data visualization tools go beyond the standard charts and graphs used in Microsoft Excel spreadsheets, displaying data in more sophisticated ways such as infographics, dials and gauges, geographic maps, sparklines, heat maps, and detailed bar, pie and fever charts. The images may include interactive capabilities, enabling users to manipulate them or drill into the data for querying and analysis. Indicators designed to alert users when data has been updated or predefined conditions occur can also be included.

Importance of data visualization

Data visualization has become the de facto standard for modern business

intelligence (BI). The success of the two leading vendors in the BI space, Tableau and Qlik -- both of which heavily emphasize visualization -- has moved other vendors

https://searchbusinessanalytics.techtarget.com/definition/business-intelligence-BI

https://searchbusinessanalytics.techtarget.com/definition/business-intelligence-BI

https://whatis.techtarget.com/definition/fever-chart

https://whatis.techtarget.com/definition/pie-graph-or-pie-chart

https://whatis.techtarget.com/definition/bar-graph

https://searchbusinessanalytics.techtarget.com/definition/heat-map

https://searchbusinessanalytics.techtarget.com/definition/sparkline

https://whatis.techtarget.com/definition/infographics

toward a more visual approach in their software. Virtually all BI software has strong data visualization functionality.

Data visualization tools have been important in democratizing data and analytics and making data-driven insights available to workers throughout an organization. They are typically easier to operate than traditional statistical analysis software or earlier versions of BI software. This has led to a rise in lines of business implementing data visualization tools on their own, without support from IT.

Data visualization software also plays an important role in big data and advanced

analyticsprojects. As businesses accumulated massive troves of data during the early years of the big data trend, they needed a way to quickly and easily get an overview of their data. Visualization tools were a natural fit.

Visualization is central to advanced analytics for similar reasons. When a data

scientist is writing advanced predictive analytics or machine learning algorithms, it becomes important to visualize the outputs to monitor results and ensure that models are performing as intended. This is because visualizations of complex algorithms are generally easier to interpret than numerical outputs.

Examples of data visualization

Data visualization tools can be used in a variety of ways. The most common use today is as a BI reporting tool. Users can set up visualization tools to generate automatic dashboardsthat track company performance across key performance

indicators and visually interpret the results.

Many business departments implement data visualization software to track their own initiatives. For example, a marketing team might implement the software to monitor the performance of an email campaign, tracking metrics like open rate, click-through

rate and conversion rate.

https://whatis.techtarget.com/definition/conversion-rate

https://searchmicroservices.techtarget.com/definition/click-rate

https://searchmicroservices.techtarget.com/definition/click-rate

https://searchbusinessanalytics.techtarget.com/definition/key-performance-indicators-KPIs

https://searchbusinessanalytics.techtarget.com/definition/key-performance-indicators-KPIs

https://searchbusinessanalytics.techtarget.com/definition/business-intelligence-dashboard

https://searchenterpriseai.techtarget.com/definition/machine-learning-ML

https://searchenterpriseai.techtarget.com/definition/data-scientist

https://searchenterpriseai.techtarget.com/definition/data-scientist

https://searchbusinessanalytics.techtarget.com/definition/advanced-analytics

https://searchbusinessanalytics.techtarget.com/definition/advanced-analytics

As data visualization vendors extend the functionality of these tools, they are increasingly being used as front ends for more sophisticated big data environments. In this setting, data visualization software helps data engineers and scientists keep track of data sources and do basic exploratory analysis of data sets prior to or after more detailed advanced analyses.

How data visualization works

Most of today's data visualization tools come with connectors to popular data sources, including the most common relational databases, Hadoop and a variety of cloud storage platforms. The visualization software pulls in data from these sources and applies a graphic type to the data.

Data visualization software allows the user to select the best way of presenting the data, but, increasingly, software automates this step. Some tools automatically interpret the shape of the data and detect correlations between certain variables and then place these discoveries into the chart type that the software determines is optimal.

Typically, data visualization software has a dashboard component that allows users to pull multiple visualizations of analyses into a single interface, generally a web portal.

Measures of Similarity and Dissimilarity

Printer-friendly version

Similarity and DissimilarityDistance or similarity measures are essential to solve many pattern recognition problems such as classification and clustering. Various distance/similarity measures are available in literature to compare two data distributions. As the names suggest, a similarity measures how close two distributions are. For multivariate data complex summary methods are developed to answer this question.

https://onlinecourses.science.psu.edu/stat857/print/book/export/html/3/

https://searchdatamanagement.techtarget.com/definition/Hadoop

https://searchbusinessanalytics.techtarget.com/definition/big-data-analytics

https://onlinecourses.science.psu.edu/stat857/print/book/export/html/3/

Similarity Measure Numerical measure of how alike two data objects are. Often falls between 0 (no similarity) and 1 (complete similarity).

Dissimilarity Measure Numerical measure of how different two data objects are. Range from 0 (objects are alike) to ∞ (objects are different).

Proximity refers to a similarity or dissimilarity.Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.Attribute Type Similarity Dissimilarity

Nominal s={10 if p=q if p≠q d={01 if p=q if p≠q

Ordinals=1−∥p−q∥n−1

(values mapped to integer 0 to n-1, where n is the number of values)

d=∥p−q∥n−1

Interval or Ratio s=1−∥p−q∥,s=11+∥p−q∥ d=∥p−q∥Common Properties of Dissimilarity MeasuresDistance, such as the Euclidean distance, is a dissimilarity measure and has some well known properties:

1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,2. d(p, q) = d(q,p) for all p and q,3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points

(data objects), p and q.A distance that satisfies these properties is called a metric.Following is a list of several common distance measures to compare multivariate data. We will assume that the attributes are all continuous.Euclidean Distance

Assume that we have measurements xik, i = 1, … , N, on variables k = 1, … , p (also called attributes).The Euclidean distance between the ith and jth objects is

dE(i,j)=(∑k=1p(xik−xjk)2)12

for every pair (i, j) of observations.

The weighted Euclidean distance is

dWE(i,j)=(∑k=1pWk(xik−xjk)2)12

If scales of the attributes differ substantially, standardization is necessary.

Minkowski Distance

The Minkowski distance is a generalization of the Euclidean distance.

With the measurement, xik , i = 1, … , N, k = 1, … , p, the Minkowski distance isdM(i,j)=(∑k=1p|xik−xjk|λ)1λ,

where λ ≥ 1. It is also called the Lλ metric. λ = 1 : L1 metric, Manhattan or City-block distance. λ = 2 : L2 metric, Euclidean distance. λ → ∞ : L∞ metric, Supremum distance.

limλ→∞=(∑k=1p|xik−xjk|λ)1λ=max(|xi1−xj1|,...,|xip−xjp|)

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

Mahalanobis Distance

Let X be a N × p matrix. Then the ith row of X isxTi=(xi1,...,xip)

The Mahalanobis distance is

dMH(i,j)=((xi−xj)TΣ−1(xi−xj))12

where ∑ is the p×p sample covariance matrix.

Self-checkThink About It!

Calculate the answers to these questions by yourself and then click the icon on the left to reveal the answer.

1. We have X=⎛⎝⎜112322112222412⎞⎠⎟. Calculate the Euclidan distances. Calculate the Minkowski distances (λ=1 and λ→∞ cases).

2. We have X=⎛⎝⎜2103372⎞⎠⎟. Calculate the Minkowski distance (λ = 1, λ = 2, and λ → ∞ cases) between the first and

second objects. Calculate the Mahalanobis distance between the first and second objects.

Common Properties of Similarity MeasuresSimilarities have some well known properties:

https://onlinecourses.science.psu.edu/stat857/node/3/#TB_inline?height=400&width=450&inlineId=myOnPageContent3


1. s(p, q) = 1 (or maximum similarity) only if p = q,2. s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data objects, p and q.

Similarity Between Two Binary Variables

The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary.

Simple Matching and Jaccard Coefficients

Simple matching coefficient = (n1,1+ n0,0) / (n1,1 + n1,0 + n0,1 + n0,0). Jaccard coefficient = n1,1 / (n1,1 + n1,0 + n0,1).

Self-checkThink About It!

Calculate the answers to the question and then click the icon on the left to reveal the answer.

1. Given data:

p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1

The frequency table is

Calculate the Simple matching coefficient and the Jaccard coefficient.


Date post:	23-Mar-2019
Category:	Documents
Upload:	nguyentruc
View:	213 times
Download:	0 times

wingsofaero.inwingsofaero.in/wp-content/uploads/2018/12/applications.docx · Web viewDATA MINING...

Documents