10746328

7/24/2019 10746328

http://slidepdf.com/reader/full/10746328 1/94

Downloaded from www.pencilji.com


1

Data Warehousing and Data Mining

Dr. G. Raghavendra Rao,Professor and Head,

Dept. of CS&E, NIE, Mysore.

Sri. K. Raghuveer,

Asst. Professor

Dept. of CS&E, NIE, Mysore.

7/24/2019 10746328




2

Data Warehousing and Data Mining

Course Introduction

In this course, you will learn about the concepts of data warehousing and data

mining. This is one of the recent additions to the IT area and a lot of work is still going

on. You will be introduced to the present status of the topic and will also be introduced

to the future trends.

The concept of data warehousing is a logical extension of the DBMS concept. In

case of the data bases, static data is stored in databases and user generated queries are

answered by the system. A data warehousing is a more complete process, wherein apart

from getting answers, you can also put this answers into standard models, like business

models, economic models etc., study their performance and modify the parameters

suitably. The query answers, in a database are an end in itself, but in a data warehouse,

they serve as valuable inputs to the Decision Support System (DSS).

The first unit speaks about the characteristics of a typical data warehouse, data

marts, the types of data models, and other details. Most importantly, you will also be

introduced to some of the exist ing software support in the area of data warehousing. At

the end of this unit, you should be able to design you own data warehouse, given the

requirements.

The second unit is about data mining. The concept is simple. Given a large

repository of data, as a data warehouse, you should be able to search for the desired data

with atmost efficiency. This concept is called mining of data. You will be taught some

of the algorithms in the context.

The third unit is a case study of the concepts involved. The case study will help

you to understand the fundamentals in a better way.

7/24/2019 10746328




3

UNIT – I

UNIT INTRODUCTION

In this unit, you are introduced to the basic concepts of a data warehouse.

The first block deals with the fundamental definitions of a data warehouse, how

they are an extension of the data base concept, but are still different from the DBMS and

also the various terminologies used in the context of a data warehouse are depicted.

The second block describes, in detail, a typical data warehouse system and the

environment in which it operates. The various data models for the data warehouse, the

various software tools used and also the two broad classifications of the data

warehouses- namely the relational and multidimensional warehouses are discussed. You

will also be introduced to some of the software available for a data warehouse designer.

The third block gives a step by step method of developing a typical data

warehouse. It begins with the choice of the data to be put in the data warehouse, the

concept of metadata, the various hardware and software options and the role of various

access tools. It is also gives a step by step algorithm for the actual development process

of a data warehouse.

7/24/2019 10746328




4

BLOCK I

BLOCK INTRODUCTION

In this block, you are briefly introduced to the concept of a data warehouse. It is

basically a large store house of data, from which users can access and view data in

various formats. Further, they allow the users to do computations, so that they can derive

maximum benefits out of the data. For example, having data about the previous 5 years

sales is good, but THE ability to computations on them so that you can predict the sales

for the next year will be a much more welcome situation. Similarly by comparing the

performances of your and your competitor's business performances, if you can develop business patterns, it will be excellent. A data warehouse provides exactly similar

opportunities.

You will also be introduced to the concept of a datamart, which can be thought of

as a subsection of a warehouse, which will allow you to view only those data, which are

of interest to you, irrespective of how big the actual data base is. You will also be

introduced to the basic requirements that the data warehouse or the datamart should

satisfy and also some of the implementation issues involved.

Contents

1. Data warehousing

2. Datamart

3. Types of data warehouses

4. Loading of data into a mart

5. Data model for data warehouse

6.

Maintenance of data warehouse

7. Metadata

8. Software components

9. Security of data warehouse

10. Monitoring of a data warehouse

7/24/2019 10746328




5

11. Block summary

BLOCK I

1. DATA WAREHOUSING

A Data Warehouse can be thought of on lines similar to other ware house – i.e. a

place where selected and (sometimes) modified operational data are stored. This data can

answer any query – which may be complex, stat istical or analytical. Data warehouse is

normally the heart of a decision support system of an organization (DSS). It is essential

for effective business operation, manipulation, strategy planning and historical data

recording. With the business scenario becoming more and more competitive and the

amount of data to be processed, according to a rough and estimation, getting doubled

every 2 years, the need for a very fast, accurate and reliable data warehouse need not be

over emphasized. A proper organization and fast retrieval of data are the keys for the

effective operation of a data warehouse.

Unfortunately, such rhetorics can explain the situation only upto a certain degree.

When one sits down to brass tacks – the actual organization of such a warehouse, the

question arises – are there any simple rules that govern data warehousing operation ? Do

the rules remain same for all types of data and all types of analysis ? what are the

tradeoffs involved ? How reliable and effective are such warehouses etc. This course

answers some of the questions. However, the concept of a data warehouse is a relatively

new one and is still undergoing lots of transformation. Hence, in this work you find only

pointers – guidelines for the effective operation of such a warehouse. But the actual

operation needs a lot more skill than simple knowledge of the ground rules.

Before we venture into the warehouse details, we see what types of data will

essentially be handled there. Normally, they are classified into 3 groups.

i.

Transaction and reference data: They are the original data, arising out of the

various systems and are comparable to the data in a database (at regular

intervals) and are also removed (purged) when their useful lifespan is over.

However, the purged data is normally archived into taper any such devices.

7/24/2019 10746328




6

ii. Derived data (or secondary data) as the name suggests are derived from the

reference data (normally as a result of certain computat ions).

iii. Denormalised data is prepared periodically for on line processing, but unlike

derived data, they are directly based on the transaction data (not on

computations).

2. DATA MART

A data warehouse holds the entire data, of which various departments o f the

decision support system would need only portions. These portions are drawn into “Data

Marts”. They can be viewed as subsets of a data warehouse. They have certain

advantages over a central data warehouse. The latter keeps growing and becomes

complex, unvieldy and difficult to understand and manage after a time. They become

difficult to customize, maintain and keep track of. Further, as volume of data increases,

the software needed to access data becomes more complex and also the time complexities

increase. (It can be compared to a very large library, where searching for a book you need

becomes that much more difficult).

On the other hand, a data mart can be thought of as a small departmental library,

which is small, elegant and easy to handle and customize. It is easy to sort, search or

structure the data elements without any global considerat ions. Obviously the hardware

and software demands are manageable. Once in a while you may not get some data that

may not be available in the datamart, which can be easily tracked to the data warehouse.

It is to be noted that the type of data that flows from the warehouse to the data

mart is of the current level type. The derived data and denormalised data are to prepared

at the data mart level itself.

However, many of the issues that affect the warehouse affect the data marts also.

You can view, for simplicity that the data warehouse is a collection of several data marts.

Hence, whatever we speak about a data mart, can be extended to a warehouse and vice-

versa unless specified otherwise.

3. TYPES OF DATA WAREHOUSES

7/24/2019 10746328




7

Data warehouses are basically of two types – multidimensional and relat ional. In

multidimensional data warehouse the data can be stored so that the multidimensionality is

not lost. Contrast this with the RDBMS method of storing, wherein they are essentially

stored as table and the multidimensionality of the data is lost. Thus, in a

multidimensional warehouse, queries can be asked on the multidimensionality of data.

At this stage, we can not describe the operation of such warehouses, but it is suffice to

say that specialized search engines are needs to support such models.

On the other hand, the relational warehouse contain both text and numeric data

and are supported by RDBMS. They are used for general purpose analysis.

4. LOAD ING OF DATA INTO A DATA MART

The data mart is loaded from the data warehouse using a load program. The

loading program takes care of the following factors, before loading the data from the data

warehouse to the data mart.

i) Frequency and schedule i.e. when & how often the data is to be loaded

ii) Total or partial refreshment – i.e. whether all data in the data mart is

modified (replaced) or is done only partially.

iii) Selection, resequencing and merging of data – when required

iv) Efficiency (or speed) of loading

v) Integrity of data – i.e. the data should not get unintentionally modified

during the transfer and also the data in the data mart should, at all times,

match that in the warehouse.

5. DATA MODEL FOR DATA WAREHOUSE

A model is to be built into the data warehouse, when large amounts of data are to

be stored. Though the data model for the data warehouse need not be corresponding to

any of the standard RDBMS models, it is desirable to choose one that is similar to the

standard one.

6. MAINTENANCE OF A DATA WAREHOUSE :

7/24/2019 10746328




8

All data warehouses, (as also DBMS for that matter) need periodic maintenance

i.e. loading, refreshing and purging of data loading can be from the data warehouse (in

case of a data mart) or from the system producing the data (in the case of a data

warehouse). Refreshing means updating the data, may be on a daily, weekly or monthly

basis. Purging of data means to read data periodically and weed out old data. The data to

be weeded may be totally removed, archived or condensed depending on the nature of

purging, but most often, it is archived.

7. METADATA

Most data warehouses, and data marts come with a metadata (data about data).

The metadata is a description of the contents and source of data of the warehouse, type of

customization on the data, description of the data marts of the warehouse, it‟s tables,

relationships etc. and any other relevant details about the data warehouse.

The metadata is created and updated (from time to time) and are very useful for

the data analyst or the systems manager to manipulate the warehouse.

8. SOFTWARE COMPONENTS

The software that goes into the data warehouse varies depending on the context

and purpose of the warehouse but normally includes a DBMS, access and creatio n and

management software.

9. SECURITY OF A DATA WAREHOUSE:

The data of the warehouse need to be protected against physical as well as

software interventions. If unauthorized access, modification and deletion need to be

prevented, the security need not suffice, if it is only at the warehouse level, but has to be

implemented at the datamart level also i.e. a person authorized for one data mart may be

prevented from approaching some other data mart or for that matter the authorities may

be valid only for port ions of the data mart. The warehouse administrator will be

responsible for implementing these security measures. the normal methods used are i)

Fire Walls: Which are softwares to prevent unauthorized access into the data warehouse/

data mart. ii) logon/ logoff passwords which prevent unauthorised login or logout iii)

7/24/2019 10746328




9

application - based security procedures iv) encryption and decription: where the

appearance of data is modified to prevent unauthorized users accessing the data.

10. MONI TORING THE REQUIREMENTS OF A DATA WAREHOUSE

The performance and contents of warehouse needs to be monitored closely-

usualy by the system manager or data administrator. Monitoring is normally done by a

"data content tracking". it keeps track of the actual contents of the data mart, types of

accessing invalid/ obsolete data in the warehouse, the rate and kind of warehouse growth,

consistency issues (between the previously present data and the newly acquired data) etc.

While the monitoring of data is often a transparent operation, it's success is very

important for ensuring the continued usefulness and reliability of the warehouse to the

common user.

10. SUMMARY

In this block, you were introduced to the basic concepts of a data mart, data

warehouse, the types of data warehouses and how a data warehouse differs from a data

base - essentially the warehouse is a multidimensional concept whereas a data base is a

relational one.

You were also introduced to the concept of metadata which contains data about

the data of the warehouse and helps the future users as well as developers to dea l with the

warehouse. You were also introduced to the issues of data warehouse security,

consistency of data, and data integr ity.

Many of these aspects will be elaborated in the next blocks.

Review Questions

1) A _________________ holds only relevant portions of data held by a data

warehouse.

2) A data warehouse normally holds _________________ data, whereas

RDBMS handle relational data.

7/24/2019 10746328




10

3) Replacement of old data values by latest values is called

_________________

4) Removal of obsolete data is called _________________

5) Data about the data available in a datamart is called _________________

6) Monitoring of data content is called _________________

7) Obsolete data in a RDBMS is deleted, whereas in a data mart it is usally

_________________

8) Data derived after computations on the primary data is called

_________________

9) The compatibility between the previously available data and new data is

called _________________

10)

The main difference between transaction data and derived data lies in

_________________

Answers

1. Datamart

2. Multidimensional

3. Refreshing

4. Purging

5. Metadata

6. Data content tracking

7. Archived

8. Derived data

9. Consistency

10. Computations

7/24/2019 10746328




11

BLOCK - II

A TYPICAL DATA WAREHOUSE SYSTEM

In this block, we introduce you to the fundamentals of an actual data warehouse.

It is presumed that you have atleast a preliminary knowledge of the concept of a database

and the concepts of the data warehouse are developed in relation to it.

The concept of the star model is introduced which is central to the concept of data

warehouse development.

Most importantly, you will be introduced to some of the existing tools available in

the market, like IBI Focus fusion, Cognos power play and Pilot Software. While it is

unlikely that you will be using any of them in this course, their study would help you to

understand the complexities involved, the various tradeoffs available and most

importantly, you will understand the concept of step by step development of the

warehouse environment.

Contents

1. Atypical data warehouse

2. A typical data warehousing environment

3. Data Modeling – star schema for multidimensional view

4. Various Data models

5. Various OLAP tools

6. Relational OLAP

7. Managed query environment

8.

Data warehousing products – state of art

IBI Focus Fusion

Cognos Power Play

Pilot Software

9. Summary

7/24/2019 10746328




12

1. A TYPICAL DATA WAREHOUSE SYSTEM

A data warehouse system, should be able to accept queries from the user, analyze

them and give the results. sometimes, they are called Online Analytical Processing

Systems (OLAP). Contrast this with the conventional online systems - which for

distinguishing purpose, we call Online Transaction Processing System (OLTP). A OLTP

most often will be capable of only a few transactions like entering data, access ing records

etc. Whereas a OLAP - or a ware housing system - should be capable of analyzing, on

line, a large number of records with varying interrelationships and summarize the relative

results. The type of data is usually multidimensional in nature. i.e. each data item is

linked to several other data in several other directions.

To make this concept clear, consider the following example:

A Student record has his name, address, marks scored in different subjects in

different years form a 2 dimensional record. Now consider the following case, using his

address, say his cityname, you can find more about his city, it's tourist spots etc. Using

the field of, say mathematics, you can find how many students had registered for

mathematics, what their addresses are etc. In this way, each link takes you around the

scenario as a huge, multidimensional space in which you can able to not only facilitate

such traversals, but also be able to consolidate the results of such traversals and present in

a suitable format - all in rea l time. One thing you can be sure of, is that such systems

need to have capacities of storing and processing enormous amounts of data at very high

speeds.

Now about the software. It is obvious that any conventional type DBMS would

be , theoretically, able to do the desired operations, but with the increase in dimensional

complexities, there will be a literal explosion of the SQL statements required to build the

query - with multiple joins, scans, aggregations and what not - and the operation would

need large amounts of space and time to do the operations.

7/24/2019 10746328




13

Thus a typical data warehousing system, at first approximation, can be said to be a

"resource hungry" system, which cannot be handled by any conventional database

system.

Comparison between a Database System and Data Warehouse System.

Data Warehouse Databsase

1.

Can handle both current and historic

data, since the current data is appended

to the historic data during updatings.

2. Transactions can be very long and

complex

3. Volume of transactions (no. of

transactions over a period of time) is

less. Also data are periodically

refreshed.

4. No concurrent transactions allowed.

i.e. only one query can access data at atime. hence, no recovery failures are

notified.

5. Queries are often predetermined,

needing high level of indexing.

1.

Can handle only current data.

Whenever an updating is done, the new

data replaces the existing data, so that

no historic data is available.

2. Transactions are short or atmost

combinations of such short

transactions.

3. Volume of transactions is very high.

4. Concurrent transactions allowed. Since

multiple users can simultaneouslyaccess/update the data, the transactions

may lead to erroneous results. Hence

transactions recovery procedures need

to be followed.

5. Transactions need low level of indexing

and hence can be large and online.

2. A TYPICAL DATA WAREHOUSING ENVIRONMENT

7/24/2019 10746328




14

We study most of the data base / data warehousing operations in terms of views at

different levels. Though the hardware/software views are exhaustive, most use rs would

like to be shielded from such details and would be happy to deal with the system at the

"users level" . i.e. They would hand over queries at the "analytical level" and the system

operates on them at the "operational level" and hands over the required results again at

the "analytical level". An analytical level can be considered to be the logical

relationships while the operational level is the level corresponding to computer

operations.

Look at the following schema

Operational Level

Extraction

Transformation

Users View

Different Views of a data warehouse

The different sources may be available in different storage devices. They are

extracted, put in a common place (only the relevant portions) and are transformed, so that

Source A Source B Source C Source D

A B C D

A B C D

7/24/2019 10746328




15

they contain the required results. These are then presented to the user in response to his

queries.

3. DATA MODEL ING- STAR SCHEMA FOR MULTIDIMENSIONAL VIEW

In a data warehousing environment, since the data warehouse comprises of a

central reposition of data, collected from different sources, may be even at different

periods of time, the problem of presenting an integrated view to the user assumes prime

importance. Obviously such divergent information cannot be integrated and used, unless

it is modelled them as independent entities. Further, such a modelling should keep the

end user's perspective in mind. The better the understanding of the user's perspective,

more effective and efficient will be the data warehouse operations. The warehouse

designer should have a thorough knowledge of the various requirements and generalities

in order to effectively capture the data model for the warehouse.

One obvious way of doing this is to view the divergent sources of data entities

that go into the warehouse as individual tables. Such data sets can be thought of as a

"star schema" - i.e. the individual tables beam into the warehouse, they will be

denormalised and integrated, so as to be fit to be presented to the enduser.

One typical warehouse example

Sub 1 Sub 2 Sub 3

Student result

(School 1)

Sub 1 Sub 2 Sub 3

Student result

(School 2)

Regional

Result

7/24/2019 10746328




16

While the data warehouse in the above example contain the results of the region,

the viewer may like to view it school wise, subjectwise or student wise. Each of these

views adds a 'dimension' to the view. Often, in large warehouses, the number of possible

dimensions will be too large to be compressed at one go.

But how does the star schema actually work ? From the Database Administrator's

angle, it is a relational schema (A Table-in Simple terms). A simple start schema has a

central "fact table" which contains raw, numeric facts. These facts are additive and are

accessed through the various dimensions. Since the fact tables contain the entire data of

the warehouse, they will be normally huge.

Dimension Dimension

Fact Table

Star schema for the above cited example

4. THE VARIOUS DATA MODELS

Now, any body with a little imagination could visualise that the simple star

structure depicted above suffices only for reasonably simple & straight forward

databases. A practical data warehouse will be much more complex, it terms of the

diversity of the subjects covered, their inter relationships and also the various

perspectives. In such a scenario, add ing more dimensions , thus increasing the scope of

the attributes of the star schema could solve the problem only upto an extent. But

sooner, rather than later, this structure collapses. To avoid such breakdowns, a better

technique called the multifact star schema or the snow flake schema is used. The main

problem of the simple star schema, it's inability to grow beyond reasonable dimensions, is

over come by providing aggregations at difficult levels of hierarchies in a given

Subject

Marks of theregion

School

Subject

7/24/2019 10746328




17

dimension i.e now a given dimension is not just a collection of entities, but it is a

collection of hierarchies, each hierarchy being a collection of entities. This goal is

achieved by normalising the respective hierarchical dimensions into more detailed data

sets to facilitate the aggregation of fact data. The data ware house itself may be a

collection of different groups, each group addresses specific performance, thus catering

to the needs of specific users or user groups. Each group of fact data can be modelled

using a separate star schema.

In essence, we are not abandoning the star schema concept, but we are building

on it. Simply put, we are dividing the complex schema into smaller schema( these into

much smaller schema, if necessary) etc.. and combining these back to get the completed

data model. Hence, the name " multifact star schema" or " snow flake schema".

5. THE VARIOUS ONL INE ANALYTICAL PROCESSING (OLAP) TOOLS

The online analytical processing tools, which form the backbone of any data

warehouse system, can broadly classified into 2 categories: The multidimensional

(MOLAP) and the Relational(ROLAP) tools. As the names suggest, the

multidimensional tools look at the database as a multidimensional entity. The addition of

each new set of entities increases the dimensions of the schema. It also implies that such

tools can work only in the presence of a multidimensional database(MDDB). A simple

example is the case of a student database, wherein each facet of the student, like his

academic performance, his extracurricular activities, his financial commitment vis-a-vis

the college etc. add a new dimension to his database. Now, if some body is interested in

his academics only, he will search only his academic dimension, while the college office

is concerned with his financial dimension only and so on.

A Relational OLAP, in contrast, looks at the data base essentially as a relation –

i.e a table. However complex the database may be, it has to be converted to a tabular

form. Then the standard relational operations can be made use of. Those fa miliar with

the operations of relational databases would recollect that this method, though simple,

7/24/2019 10746328




18

involves lot of redundancy and also would make the processing of the data cumbersome.

However, a relational OLAP has the advantages of simplicity & uniformity.

There are several hybrid approaches also i.e they integrate the two methods at

various levels – like having a table of hierarchies, for example, and these relations are

usually multirelational systems. However, all these tools basically implement the “star

schema”. As a thumb rule, one can say that the Relational OLAP are used where the

complexity of the applications are at the lower ends and also the performance

expectations are limited. I.e in situations, wherein the effects available for system

development are limited. However, as the complexity increases, the relat ional models

generally become both unwieldy and less efficient, performance wise and the choice is

normally the multidimensional OLAP.

With this introduction, we now look into the typical multidimensional and

Relational Architecture.

6. RELATIONAL OLAP

The main strength of this architecture is

Info Request

SQL

Result set Result set

Relational OLAP

It‟s simplicity and universality. It can support any number of layers of data and

the main advantage is that new data can be added as additional layers, without affecting

Data Base Server

ROLAP Server Front end Tool

Metadata

request

processing

7/24/2019 10746328




19

the existing data. I.e the database can be thought of as a collection of two dimensional

relational tables that can be used to produce multidimensional views. The other

advantage is the availability of a several strong SQL engines to support the complexity of

multidimensional analysis. The Relational databases have grown over several years and

hence the entire expertise available can be made use of to provide powerful search

engines to support the complexity of multidimensional analysis. These include creating

multiple SQL statements to handle complex user requests, optimizing these statements

using standard RDBMS techniques and searching the database from multiple points.

But, what makes the relational OLAP most attractive, possibly is it‟s flexibility and also

the availably of products that can work on un-normalised database designs efficiently.

However, the Relational OLAP comes with it‟s standard operations( inspite of the

optimizations at the DBMS level, described above). Thus, recently the Relational OLAP

are shifting towards the concepts of middleware technology to simplify its visualization,

design and applications. Also, instead of pure relational OLAP, hybrid systems, which

make use of the relational operations only to the extent to which it remains convenient

and beyond that level make use of other methods are coming into existence.

The multidimensional OLAP:

Load Info Request

SQL

Result set Result set

This is the other style of design available to a data warehouse designer. Here, the

data is basically organized in an aggregate form. I.e instead of the simple 2-dimensional

Data Base Server

MOLAP Server Front end Tool

Metadata

request

processing

7/24/2019 10746328




20

relational model, each data object interacts with others in a variety of ways and obviously

capturing these multidimensional interactions would need a multidimensional database

operation with tight coupling between the applications. Efficient implementations would

store the data in a form in which it is utilized most of the time, during the search process.

I.e at the design stage itself, the programmer should not only visualize the various

interactions between the data elements, but also should have an idea about what the end

user will be expecting the search engines to search for – atleast most of the time. He has

to work under the dual constraints of visualizing the multiple-several times invisible at

first sight – relationships on one hand and capturing such relationships to ensure most

optimal pattern of storage on the other. Most commercial products based on this category

come with the concept of “time” in its operation. I.e it is not sufficient if the data is made

available, but it should be made available within specified times.

This discussions brings out one major limitation of such systems-namely the

maintainability. Any database, worth it‟s name, will not be static. New data are added,

old ones are deleted and what is more, certain relationships may get modified.

Incorporating such frequent changes into a multidimensional OLAP can be tricky.

Several suppliers provide standard tools, which to some extent, take care of such

modifications – but they can be useful only if the changes are not too drastic.

Thus, the multi-OLAP are best utilized when used for applications that require

iterative operations and comprehensive t ime series analysis of trends.

7. MANAGED QUERY ENVI RONMENT (MQE)

The latest OLAP provide the users with the capability of performing analysis directly

against the database, albeit in a limited sense. They can bring in a limited

multidimensional OLAP server to simplify the query processing environment – hence the

name managed query environment (MQE). Though the actual implementations differ,

the concept can be highlighted somewhat in the following manner: -

7/24/2019 10746328




21

Instead of the user going to the database every time, he bring a adhoc “data cube”

to his local system. (The idea can be imagined as follows – There is a huge “ cube ” of

data available on the data base, out of which you will frequently be making use of some

portions make a copy of a “ section” of data in your machine). This can be done by first

developing a query to select the data from the DBMS, had it delivered to your desktop

and then afterwards it can be manipulated (accessed, updated, modified) so as to reduce

the overhead required to create the structure each time the query is executed. In another

approach, these tools can work with multidimensional OLAP servers, so that the data

from RDBMS can first go to multidimensional OLAP and then on to the desktop.

This approach provides for ease of operation and administration especially when

the end user is reasonably familiar with RDBMS operations. It is cost effective and

efficient at the same time.

However, certain shortcomings persist with the data cubes being built and

maintained on separate desktops. The factors of data redundancy and data integrity need

to be addressed more effectively. Also, in multiuser systems, if each user chooses to

maintain his own data cubes, the system will come under a lot of strain and data

consistency may take a beating. Thus, this method can be effective only when the data

volumes are small.

8. DATA WAREHOUSING PRODUCTS – STATE OF ART

Data mining and Data ware housing being a relatively a new and a fast growing

field, any commitment on the state of art of the technology is hazardous. Further,

because of the intuitive approach to the problem by the corporates, newer and better tools

keep flooding the market. What we are discussing in the next few pages can taken as a

guideline to the type of products available – need not be taken as an exhaustive list of

options available.

7/24/2019 10746328




22

Basically all the data warehousing tools available in the market allow the user to

aggregate the data along common dimensions – ie. The user can choose the group the

data in various forms so that, at a later time, he can navigate along these dimensions with

the click of a button. However, while the tools provide the facility to integrate data – the

actual choice of integration methods are still with the user and as such an insight into the

type and suitability of data are essential for effective use of these tools.

Roughly, these tools work in two different ways. One set of tools like Oracle‟s

Express preaggregate the data into multi-dimensional databases. Other tools directly

work against relational data. While the merits of each of these approaches are datatable,

leading database vendors like Oracle & Microsoft have taken steps to incorporate some

of the features of OLAP into their normal database softwares. So, a time may come when

data warehousing tools will not be separate tools, but will form a part of the database

software.

Now, to the features of some o f the tools available. (you may not be able to work

on them, but the discussions tell you about the fascinating number of options provided by

them)

I BI Focus Fusion

Focus Fusion from information builders Inc. uses multidimensional database

technology, with business application analysis packages. It combines a high

performance, parallel capable engine, supplemented with administrative, copy

management & access tools. Some of the features are:

1.

Fast query & reporting: With advanced indexing, parallel query & rollup

fac ilities provide high performance. Indexing ensures appropriate

arrangement of data to ensure direct accessing capabilities. Parallel query

facilities ensure that several sections of the query are searched independent of

one another, so that the overall search time is minimized. The software also

7/24/2019 10746328




23

ensures scalability, so that it is possible to use the solutions for data

warehouses of varying sizes, or those built in stages. Scalability would be a

critical factor when dealing with warehousing of data of large corporates.

2. It has a comprehensive GUI(Graphic User Interface) to facilitate ease of

administration.

3. A complete portfolio of business intelligence applications, that span a wide

range of reporting, query, decision support and EIS needs, with fully

integrated models.

4. Integrated copy management facilities, which schedule automatic data refresh

from any source into Fusion. This would ensure data scalability and integrity.

5. Fusion works with a wide variety of desktop tools, including the world wide

web browsers. This is ensured with a open access via industry- standard

protocols like SQL, ODBC & HTTP.

6. Three t iered reporting architecture will ensure high performance.

7. Provides for several classes of precalculated summaries. Also provides for

data manipulation capabilities that allow for every conceivable type of

manipulations.

8. For any large scale data base operation, it is desirable that the data is

partitioned into several classes. Fusion provides for a fully transparent

partitioning without disrupting the users.

9. Support for a parallel computing environment.

10. Seamless integration with more than 60 different databases on various

platforms.

Cognos Powerplay

This is a OLAP that can interface and operate with a wide variety of software

tools, data bases and applications. The highlight of the solution is that it stores data into

multidimensional data sets called “ power cubes”. These power cubes are stored on

either a Cognos universal client or on a server. They can also be on the LAN or inside

7/24/2019 10746328




24

popular relational databases. Power play features fast installation and deployment

capabilities, scalability and economic costs.

1. Supports data cubes of more than 20 million records

2. Supports “scatter charts” that let the users show data across two measures, so

that comparisons can be made ( ex: the budgeted values and actual figures can

be shown side by side for comparison).

3. Supports linked displays: i.e. multiple views of the same data in a report

4. Has a large number of formatting features for financial reports like single and

double underlining, brackets for negative numbers etc.

5. Unlimited levels of undo operations & customizable tool bars.

6.

Features of word processing, spread sheet or presentation software features.

7. 32 bit word length, so can work better with the latest OS.

8. Can create power cubes with the existing data bases supported by packages

like Oracle, SYBASE etc..

9. Schedules power cube creations for offpeak processing times.

10. Advanced security features to lock by dimension, category either on client or

server or both.

11. Users can pull subsets of information from the server to process on the client.

12. Data base management and data base security features are integrated.

13. Data cubes can be created by drawing data from different data sources.

14. Multidimensional data cubes can be created & processed.

In principle power play manages the query analysis as a process that runs on a

population of data cubes.

Pilot Software

It is a package of several PILOT decision support tools that form a high speed

multidimensional database. Some of the software that form the core of the offerings are:

7/24/2019 10746328




25

1. PILOT Analysis server: A multidimensional database with GUI. It includes

the latest version of expert level interface, a multidimensional relational data

stores.

2. PILOT link: A data base connectivity tool that provides ODBC connectivity

via specialized drivers to most relational databases. It also comes with a GUI.

3. PILOT Designer: To develop applicat ions speedily.

4. PILOT Desktop: Used for navigation & search between several

multidimensional databases.

5. PILOT sales and marketing analysis library: Provides applications that allow

sophisticated sales & marketing models to be visualized. It also allows the

user to modify the tools to satisfy specific deviations.

6.

PILOT internet publisher: allows users to access PILOT databases via

browsers on the internet.

The main advantage of having such differentiated tools is that it is easy to modify &

customize the applications.

The other features that are common to the PILOT software are:

1. Many of them provide time as one of the dimensions, so that periodic reports,

updations and shifting from one time base to another becomes

straightforward.

2. Provide integrated, predictive data mining in a multidimensional environment.

3. Provide for compression of spare cells ( those cells which have no value, but

still form the part of the matrix). Compression of ole( time-based) cells,

implicit declaration of some dimensions( they need not explicitly specified in

the query, but are automatically calculated, as long as they are defined as the

attribute of certain other dimensions), creation of dynamic variables etc. All

these features decrease the total size of the database and hence reduce the time

for navigation, without actually losing data.

4. Allow for seamless integration to existing OLTP. The users can also specify

the view of the database that they frequently refer and the system self

optimizes the relevant queries.

7/24/2019 10746328




26

The above is not an exhaustive list of tools nor are the features completely listed. They

only indicate the type of supports one can expect from such tools and would be further

useful in deciding one tool over the other for actual implementation.

Summary

In this block, you were introduced to the differences between the OLTP

(database) and OLAP (Warehouse) concepts. Some of the concepts underlying a typical

data warehouse were discussed in brief. You also learnt about the star schema modelling

and about three commonly used tools - IBI Focus Fusion, Cognos Power play and Pilot

Software.

Review Questions

1. OLAP stands for _________________

2. OLTP stands for _________________

3. In a OLAP system, the volume of transaction is _________________

4. A _________________ manages both current and historic transactions .

5. A star schema is organised around a central table called _________________ table.

6. _________________ are locally situated multidimensional data sets, which form

subsets of the data warehouse.

7. _________________ is the ability of the application to grow over a period of time

8. _________________ software comes with special, business oriented applications.

9. Power cube creations are normally scheduled for _________________ periods to

reduce the load on the system

10. DSS stands for _________________

Answers:

1.

On Line Analysis Processing

2. On Line Transaction Processing

3. Low

4. OLAP

5. Fact Table

7/24/2019 10746328




27

6. Power cubes

7. Scalability

8. PILOT

9. Offpeak

10. Decision Support Systems

7/24/2019 10746328




28

BLOCK - III

THE PROCESS OF A DATA WAREHOUSE DEVELOPMENT

In this block, you will be introduced to the step by step methodology of

developing a data warehouse. Beginning from the choice of the subject matter, a brief

introduction to the various stages of development, tradeoffs involved and pitfalls in each

are discussed in brief. You are advised to through in material in detail and ensure that

you understand various terminology involved. It is needless to say, how ever, that the

development of a data warehouse is both an art and a science. While the science portion

can be taught, the art portion is to be developed by practice.

Contents:

1. When do we go for a data warehouse?

2. The basic strategy for a data warehouse

3. Design of a warehouse

4. Data content

5. Metadata

6. The actual development process

7.

The process of a data warehouse des ign

8. Considerations of Technology

i. Hardware platforms

ii. The DBMS

iii. Networking capabilities

9. Role of access tools

10. A data warehouse implementation algorithm

11. Summary

7/24/2019 10746328




29

THE PROCESS OF A DATA WAREHOUSE DEVELOPMENT

1. WHEN DO WE GO FOR A DATA WAREHOUSE?

As we have seen earlier, a data warehouse is built usually to get answers to

strategic questions regarding policities & strategies based on past (historical) data. From

the business perspective, it is a tool for the quest for survival in the competitive

environment. What decisions previously used to take weeks & months to arrive at, are to

be taken with the hours, if not minutes. Added to the demands on speed, is the increase

in volume of data available to be processed. Since the available data in most business

areas are predicted to double every two years, the need for efficient & reliable data

warehousing cannot be over emphasized.

Add to this the changes that keep taking place. Entire business models keep

getting modified, if not totally being discarded and we get a reasonable perspective for

efficient data warehousing.

Hence, the need to organize, maintain large amounts of data, so that they can be

analyzed within minutes in the manner and depth desired becomes important. Thus, one

cannot fail to identify the need for efficient data warehousing strategies.

Before we start looking into the actual design aspects of a data warehouse, we

would also see why the conventional information systems could not meet the

requirements? The conventional DBMS systems originated basically for homogeneous

and platform dependent applications. Also, they were des igned for data that often

changes slowly and also to situations where the search times were reasonably high. But

with the advent of very fast CPUs and larger and cheaper disk spaces, the ability and the

need to work on very large databases which are dynamic was felt. (The concept of

networking with ever increasing bandwidths made the available data as well as the

7/24/2019 10746328




30

results, highly dynamic). Thus, an alternative, online analytical processing, as opposed to

the online transaction processing was felt. And hence the OLAP systems.

Having once again assured ourselves about the basic features involved in data

warehouses, in the following sections we survey the issues involved in building a

warehouse – beginning from the design approaches, architectures, design trade offs,

concept of metadata, data rearrangement, tools and finally the various p erformance

considerations.

2. THE BASIC STRATEGY FOR A DATA WAREHOUSE

Just like any other software, a data warehouse can be built on either a top-down or

bottom up approach. i.e. one can begin with the overall structure required, break it into

modules, submodules etc.. i.e. we can begin at the level of a global data warehouse for

the entire organisation, split it into individual warehouses( data marts) for the

departments, break it further based on products/locations etc.. until we arrive at modules

which become small enough to be handled independently. Each of these can built by

one/more project groups, (often in parallel) and can be integrated to suit the original

needs.

Alternatively, begin at the lower end, combine the sub-data marts, into data marts

into the data warehouse, to get all possible analysis that you can get from the warehouse.

However, the discussion is not just about systems & programming. One will also

have to look into the location of the various departments, the levels of interactions

between them, the parts of data flow, the sources of data and the demand centres of

analysed information etc. and arrive at a suitable model. Often, a suitable combination of

top down and bottom up designs ( or further combinations there of) are used.

3. THE DESIGN OF A WAREHOUSE

As you know, the very first stage in any software project, is the design. In the

case of a data warehouse, the problem is a little complex one, because of the volume of

7/24/2019 10746328




31

data and the dynamic nature thereof. However, the very first stage can be, definitely, to

take a holistic approach of the proposed data warehouse - – Identify all possible sources of

data ( present and future), their possible utility for the various departments, the possible

path of data travel etc.. and arrive at a comprehensive single, complex system that

effectively encaptures all possible user requirements. Any failure at this stage would

result in a skewed data warehouse (not balanced), that cater to only a few requirements,

shutting out others. This may, in the long run, undermine the utility of the warehouse

itself. The main difficulty arises in identifying future trends & make room for them.

Further, to enhance the data accessibility, especially in organisations that are

geographically spreadout, web enablement would be highly desirable.

However, there are three major issues in the development of a data warehouse,

that need very careful consideration.

1. The available data, will more often than not, be heterogeneous in nature. Ie

since the data came from various, unconnected sources, they need to be converted

to some standard format, with reference to a uniformly recognised base. This

requires a fair amount of efforts and ingenuity. Also, the data needs to be

maintained. Ie with the passage of time, the data becomes obsolete and requires

updation. Again, because various pieces of data are from different sources, a

substantial amount of effert is required to upgrade them uniformly to maintain

data integrity. Since important decisions are taken based on the data values, their

reliability and authenticity should be beyond doubt at all times.

2. Unlike databases, in data warehouses, historic data connot be scrapped, but

have to be arranged in a format that is both concise and precise on one the hand

and cost effective on the other. This is a very fundamental challenge in any data

warehouse operation, but needs to be addressed at the design level itself.

3. Mainly because of the above considerations and also because of the constant

inflow of latest data, the warehouse tends to grow out of proportions very shortly.

7/24/2019 10746328




32

Specific instructions are to be left to identify and weed out old data subject to the

constraints imposed by condition (2) above.

Thus, one can safely presence that the design of a warehouse is definitely more

complex and tricky compared to a database design. Also, since it is business driven and

business requirements keep changing, one can safely say, it is not a one time job, but is

a continuous process.

4. DATA CONTENT

Compared to a Database, a warehouse contains data which need to be constantly

monitored and modified, if found obsolete. Also, the level of abstraction in a data

warehouse is more detailed, partly to facilitate ease of analysis and partly to ensure ease

of maintenance.

Thus, the data models used in a data warehouse are to be chosen based on the

nature, content and the processing pattern of the data warehouse. Before the data is

actually stored, one will have to clearly identify the major components of the model,

their relationships, including the entities, attributes, their values and the possible keys.

But the more difficult process is the ability of the designer to identify the query

process and the path traveled by a query. Because of the varying nature of queries, it is

more easily said than done. Visualising all possible query combinations, their

frequency etc.. before arriving at the most optimal storage patterns is the key to a

successful design. In addition to optimising the data storage for a high level query

performance, one should also keep in mind the data storage requirements and data

load ing performance of the system.

Thus, no specific rules for the design can be prescribed and a lot of finetuning

based on experience needs to be done. Further, since the data handled will be normally

voluminous, a decision on it‟s actual distribution, whether on a single server, on several

servers on the network etc. is to be taken. It can also be divided based on region, time

7/24/2019 10746328




33

or subject. Of course, it is needless to say that each of these need to be optimised

individually, as also in combinations.

5. METADATA

Since the type of data in a warehouse in voluminous contentwise and varying in

terms of the models, the relationships between the databases, amongst themselves and

with the warehouse in total, needs to be made known to the endusers and the endusers

tools. The metadata defines the contents and the location of the data in the warehouse.

This would facilitate further updating and maintenance of the data warehouse. It is

used by the users to find the subject areas and the definitions of data. It also helps the

users to modify and update the data and datamodels. It essentially acts as a logica l link

between the decision support system application and the data warehouse.

Thus, a data warehouse designer would also create a metadata repository which

has access paths to all important parts of the data warehouse at all points of time. The

metadata works like a access buffer between the tools and the data and no user or tool

can directly meddle with the data warehousing environment. The actual choice of the

format for the metadata, ofcourse, is left to the designer.

6. THE ACTUAL DEVELOPMENT PROCESS

As we have seen earlier, a number of tools are available for each phase of

development. They provide facilities for defining the transformation and cleanup, data

movement, query processing, reporting and analysis. They differ in capabilities and

compatibility‟s and it is left to the designer to choose appropriate tools and also modify

his design modules to fit to the capabilities of these tools.

No doubt, the metadata should be able to effectively address to database and the

tools that are used. Further, an injudicious choice of the tools or diluting the design

specifications to accommodate the tools may result in inefficient data warehouses

which will soon become unmanagable.

7/24/2019 10746328




34

Having seen the various stages of a data warehousing design, we will look at an

actual step by step procedure to design workable data warehouses.

7. THE PROCESS OF A DATA WAREHOUSE DESIGN:

The process of a data warehouse design is complex because of the vague nature of

the goals available. Quite often, all the guideline that is available to a data warehouse

designer is “take all the enterprise data and build a data warehouse, so that the

management can get answers to their questions”.

In such a situation, all that the designer has to do is to start somewhere and get

going. The most common technique is to develop a datamart and gradually blow it to a

full fledged data warehouse.

Ralph Kimball identifies a nine step strategy to build a data mart.

They are:

1. Choose the subject matter (one subject at a t ime)

2. Decide what the fact table represents

3. Identify and conform the dimensions

4. Choose the facts

5. Store pre-calculations in the fact tab le

6. Define the dimensions and tables.

7. Decide the duration of the database and the periodicity of updation

8. Track s lowly the changing dimensions

9. Decide the query priorities and query models

10. Build a few simple data marts and

11. Integrate them in stages

Let us briefly look into the details of the above steps

1. Often, even people who have worked with the organisation for several years

will find it difficult to the clearly identify the areas of activity and partition

them. Hence, the warhouse designer, normally an outsider, would find it quite

7/24/2019 10746328




35

difficult to decide on the various subject matters to deal with of course, he will

interact with the users of the proposed warehouse at various levels and elicit

their answers based on interviews and questionnaires, going through the

various documents or by simply watching the procedures. If there is already a

level of computerisation, of course, the DBAs would give invaluable

information regarding sources of data, their quality and validity.

Armed with these informations, the designer will have to decide on his

own as to how to partition the activities into his subject matters and which of

them should be implemented to begin with. Normally, the „hot subjects‟ will

be given priority. Ie those which are likely to interest most people or those

which are likely to immediately benefit the organisation.

2. A fact table is a large control table in a dimensional design that has a

multipart key. The parts of the key can be combined to form query keys for

the datamart. For example, for a student data base, the fact table may contain

his various particulars and each of them can be a part of the key. Converting

the facts into the fact table is a very crucial step and involves several

brainstorming sessions.

3. The dimension table design is the next important step, which converts the fact

table to a multidimensional table. Each dimension normally refers to one a set

of related activities and would let to a multidimensional or relational database

as the ease may be. The dimensions are the source of new headers in the

user‟s final reports. Since the choice of the dimensions freeze the data

warehouse specifications to some extent, sufficient thought for future growth

of the warehouse or of the organisation itself.

Duplicate or super fluous dimensions should be avoided while

compromising with the long range perspectives of the warehouse. However,

if two datamarts endup having the same dimensions, they should conform to

each other. This would ensure ease of standardising the queries.

7/24/2019 10746328




36

The remaining steps are the logical followup of the first three stages.

4. The choice of the facts, though appears simple, can some times be tricky,

especially if step 1 above is not carr ied out properly. All facts that pertain to

the dimensions should be correctly identified and also their links to other data

items are to be ascertained.

5. The relations between the various entities are expressed in terms of pre-

calculat ions and are stored in the fact tables.

6. This stage involves the choice of the number, content and dimensions of the

various tables used in the operation. While the selection may appear simple,

one has to note that choice of two few tables would make each of them too

voluminous and hence the query processing becomes inefficient. On the other

hand, too many small tables would create problems of storage, consistency

and data integration.

7. The duration of the databases and periodicity of updation is decided mainly by

the type of operations of the organisation, the frequency of data sampling and

to some extent the time & space constraints of the software programmer. As

already indicated, any updation would mean the previous data is stored as

historic data, in a suitable format, depending on it‟s importance.

8. and 9 require several iterations, spread over a per iod of time, and possibly

would involve accommodating conflicting priorities.

10 and 11 are self explanatory.

8. CONSIDERATIONS OF TECHNOLOGY:

7/24/2019 10746328




37

While the above discussion talks of the implementationed issues, several

technological issues also need to be addressed. Some of them are:

i) The hardware platforms

ii) The DBMS

iii) Networking infrastructure

iv) Operating systems and system management platforms

v) Software tools

i) Hardware platforms: While implementing a data warehouse, the existing hardware

can be utilized, provided the disk storage space is sufficient : usually of the GB order.

Apart from the record size, sufficient space for processing, indexing, swapping etc need

to be made available, apart from, of course, the space required for the system software.

Further, because of the sensitivity of data, sufficient scope for backup of the data is to be

built in to safeguard against crashes etc.. Through any reasonably fast processor can be

used, the trend is to go for a dedicated data warehouse server. Such servers, apart from

being able to support large data volumes and fast operations, are scalable – a very

important characteristic for a warehouse, as a practical data warehouse keeps growing

throughout it‟s life cycle. In fact, as the data volume increases, the capabilities, need to

increase more than proportionately, to take into account the more complex indexing &

computational aspects. Further, if the querying is to go on a public data network (like

internet), a multiprocessor configuration with sufficient I/O bandwidth is essential and a

balance between the I/O and computational capabilities of the server is to be achieved. If

this is not done, the I/o processing could endup as a bottleneck. This is done easily by

choosing different types of processors (not just a multiprocessor system, but a multi-type

processor system) and also having disk controllers, which (sometimes more than one) to

control the required number of disks.

But it is needless to say that for maximum efficiency, each major component of

the system should be selected such that optimum performance and scalability is achieved.

7/24/2019 10746328




38

Otherwise, one or the other component will endup blocking further innovations to the

warehouse.

ii) Choice of the DBMS: This is as important, if not more, as the hardware selection, as

it determines the performance of the warehouse to a no lesser extent. Again, the

parameters remain the same – scalability, ability to efficiently handle large volumes of

data and speed of processing.

Almost all the well known DBMS – Oracle, Sybase, DB- support parallel

database processing. Some of them also provide special features for operating datacubes

(described in the previous chapter)

iii) Networking capabilities: Most data warehousing applications work on a intranet

(within the organisation) and a few may also work in the internet environment (web

enabled). The choice to put it in a network itself is decided by various factors like

security, privacy on one hand, counter balanced by accessibility & spread on the other.

While not many extra hardware for networking may be used (apart from those normally

used) for warehousing, software considerations & planning process tend to become

definitely more complex.

9. ROLE OF ACCESS TOOLS

Though readymade data warehouses to suit every needs are hard to get, several

tools are available to ease the implementation of the warehouse. However, care is to be

excercised to choose the best suitable tools(note the word best “ suitable” tools, not the

best tool, for no such best tool exists) to compare and understand their capabilities, a few

of the following reports are generated on a trail basis.

1. Statistical analysis

2. Data visualisation, production of graphical reports

3. General statistical analysis

4. Complex textual search (text mining)

7/24/2019 10746328




39

5. Generation of user specific reports

6. Complex queries which travel across multiple tables, multilevel subqueries &

involve sophisticated computations.

10. A DATA WAREHOUSE IMPLEMENTATI ON ALGORITHM

Step1 : Define the data sources

Step2 : Create a datamodel, decide on the appropriate hardware and software

platforms

Step3 : Choose the DBMS & other tools

Step4 : Extract the data from the sources, and load into the model

Step5 : Create the database connectivity software, using the various tools chosen

in steps 2 & 3

step6 : Define / choose suitable GUI (presentation)software

step7 : Devise ways of updating the data, by channelising the data from the data

sources, periodically.

SUMMARY

You have been briefly introduced to the various stages of a data warehouse

development, with an algorithm emerging out of the discussions. The stages, namely

collection of requirements, creating a data model, indicating data sources and data users,

choice of hardware and software platforms, choice of reporting tools, connectivity tools

and GUI and refreshment of data periodically form the core of any data warehouse

development.

The next unit, which is a case study, is to be studied, bearing in mind these

fundamentals.

7/24/2019 10746328




40

Review Questions

1. The process of removing the deficiencies and loopholes in the data is called

____________ of data.

2. The design of the method of information storage in the data warehouse is defined by

the ___________.

3. ___________ provides panters to data of the data warehouse.

4. A reasonable prediction of the type of queries, that are likely to arise, help in

improving the ___________ of search

5. A balance between the ___________ processors and ___________ processors is

necessary for better performance of the data warehouse.

6.

Name any two methods of identifying the business requirements, ____________ and

______________.

7. GUI stands for _________________

8. The two basic design strategies of OLTP are ___________ and _____________.

Answers:

1. Cleaning up

2. Data model

3. Metadata

4. Efficiency

5. Input/output, computational

6. Interviews and questionnaire's.

7. Graphical User Interface.

8. Top Down and Bottom up

Reference Books:

1. CSR Prabhu, ' Data Warehousing: Concepts, Techniques, Products and Applications',

PHI, New Delhi - 2001.

7/24/2019 10746328




41

UNIT II

DATA MINING

COURSE INTRODUCTION

We know lots of data is being collected and warehoused. Data collected and

stored at enormous speeds. Data mining is a technique for semi-automatic discovery of

patterns, associat ions, changes, anomalies, rules in data. Data mining is a

interdisciplinary in nature. In this course you study the importance of data mining,

techniques used for data mining, web data mining and knowledge discovery in databases.

7/24/2019 10746328




42

BLOCK - 1

DATA MINING

Data Mining - An Introduction

1.0

Introduction

1.1 What is data mining?

1.2 Few applications

1.3 Extraction Methods

1.4 Trends that Effect data Mining

1.5 Summary

1.0

Introduction

The field of data mining is emerging as a new, fundamental area with important

applications to science, engineering, medicine, business and education. Data mining

attempts to formulate, analyze and implement basic induction processes that facilitate the

extraction of meaningful information and knowledge from unstructured data. Data

mining extracts patterns, changes, association and anomalies from large data sets. Work

in data mining ranges from theoretical work on the principles of learning and

mathematical representation of data to building advanced engineering systems that

perform information filtering on the web. Data mining is also a promising computational

paradigm that enhances traditional approaches to discovery and increases the

opportunities for break through in the understanding of complex physical and b iological

systems.

7/24/2019 10746328




43

1.1 What is data mining

Data mining is the semi-automatic discovery of patterns, associations, changes,

anomalies, rules and statistically significant structures and events in data. i.e., data

mining attempts to extract knowledge from data.

Data mining is an interactive, semi automated process begins with raw data.

Results of the data mining process may be insights, rules or predictive models.

The focus on large data sets is not an just an engineering challenge, it is an

essential feature of induction of expressive representation from raw data. It is only by

analyzing large data sets that we can produce accurate logical descriptions that can be

translated automatically into powerful predictive mechanisms.

1.2 Few applications

The opportunities today in data mining reset on a variety of applications. Many

are interdisciplinary in nature.

a) Neural Networks - Neural networks are systems inspired by the human brain. A

basic example is provided by a back propagation network which consists of input

nodes, output nodes and intermediate nodes called hidden nodes. Initially, the nodes

are connected with random weights. During the training, a gradient descent algorithm

is used to adjust the weights so that the output nodes correctly classify data presented

to the input nodes.

b) Tree - based classifiers - A tree is a convenient way to break a large data sets into

smaller ones. By presenting a learning set to the root and asking questions at each

interior node, the data at the leaves can often be analyzed very simply. Tree based

7/24/2019 10746328




44

classifiers were independently invented in information theory, statistics, pattern

recognition and machine learning.

c) Graphical Models and Hierarchical Probabilistic representation - A directed

graph is a good means of organizing information about qualitative knowledge about

conditional independence and causality gleamed from domain experts. Graphical

models were independently invented by computational probabilistic and artificial

intelligence researchers studying uncertainty.

d) Ensemble learning - Rather than use data mining to build a single predictive model,

it is often better to build a collection or ensemble of models and to combine them, say

with a simple, efficient voting strategy. This simple idea has now been applied in a

wide variety of contexts and applications.

e) Linear algebra - Scaling data mining algorithms often depends critically upon

scaling underlying computations in linear algebra. Recent work in parallel algorithms

for solving linear system and algorithms for solving sparse linear systems in high

dimensions are important for a variety of data mining applications, ranging from text

mining to detecting network intrusions.

f) Large scale optimization - some data mining algorithms can be expressed as large

scale, often non-convex, optimization problems.

g) Databases, Data Warehouses and Digital Libraries - The most time consuming

part of the data mining process is preparing data for data mining. This step can be

stream - lined in part if the data is already in a database, data warehouse or digital

library, although mining data across different databases.

h) Visualization of Massive data sets : Massive data sets, often generated by complex

simulation programs, required graphical visualization methods for best

comprehension.

7/24/2019 10746328




45

i) Multi-media documents : Few people are satisfied with today's technology for

retrieving documents on the web, yet the numbers of documents and the number of

people accessing these documents is growing explosively. In addition, it is becoming

easier and easier to archive multi-media data, including audio, images and video data,

but harder and harder to extract meaningful information from the archives as the

volume grows.

j) Electronic commerce - Not only does electronic commerce produce large data sets in

which the analysis of marketing patterns and risk patterns is critical, but unlike some

of the applications above, it is also important to do this in real or near - real time, in

order to meet the demands of on-line transactions.

1.3 Extraction Methods

Information extraction is an important part of any knowledge management

system. Working in conjunction with information retrieval and organization tools,

machine driven extraction is a powerful means of finding contents on the web.

The precision and efficiency of information access improves when digital content

is organized into tables within a relational database. The two main methods of

information extraction technology are

- Natural language processing

- Wrapper induction

Information extraction identifies and removes relevant information from texts,

pulling

information from a variety of sources and aggregates it to create a single view.

7/24/2019 10746328




46

1.4 Trends that effect data mining

The following few trends which promise to have a fundamental impact on data

mining

Data trends : Perhaps the most fundamental external trend is the explosion of digital

data during the past two decades. During this period, the amount of data probably has

grown between six to ten orders of magnitude. Much of this data is accessible via

networks.

Hardware trends : Data mining requires numerically and statistically intensive

computations on large data sets. The increasing memory and processing speed of

workstations enables the mining of data sets using current algorithms and techniques that

were too large to be mined just a few years ago. In addition, the commoditization of high

performance computing through workstations and high performance workstation clusters

enables attacking data mining problems that were accessible using only the largest

supercomputers a few years ago.

Scientific computing trends : Data mining and knowledge discovery serves an

important role linking the three modes of science, theory, experiment and simulation,

especially or those cases in which the experiment or simulation results in large data sets.

Business trends - Today businesses must be more profitable, react quicker and offer

higher quality services than ever before and do it all using fewer people and at lower cost.

With these types of expectations and constraints, data mining becomes a fundamental

technology, enabling businesses to more accurately predict opportunities and risks

generated by their customers and their customer's transactions.

1.5 Summary

Data mining is the semi - automatic discovery of patterns, associations, changes,

anomalies, rules and statistically significant structures and events in data. Data mining

7/24/2019 10746328




47

can be applied with Neural networks, tree based classifies , ensemble learning, linear

algebra, optimization, Databases and more.

1.6 Question /Answer Key

1.Data mining attempts to extract ______________ from data.

2. _____________ are systems inspired by the human brain is used for data

mining

3. The two main methods of information extraction are ____________ and

________________

Answers

1. Knowledge

2. Neural networks

3. Natural language processing, wrapper induction

BLOCK - 2

DATA MINING FUNCTIONS

2.0 Introduction

2.1 Classification

7/24/2019 10746328




48

2.2 Associations

2.3 Sequential patterns

2.4 Clustering/Segmentation

2.5 Summary

2.0 Introduction

In this unit you are going to study various data mining functions. Data mining

methods may be classified by the function they perform or according to the class of

application that can be used in. Data mining functions is help ful in solving real

world problems.

2.1 Classification

Data mine tools have to infer a model from the database, and in the case of

supervised learning this requires the user to define one or more classes. The database

contains one or more attributes that denote the class of a tuple and these are known as

predicted attributes whereas the remaining attributes are called predicting attributes. A

combination of values for the predicted attributes defines a class.

When learning classification rules the system has to find the rules that predict the

class from the predicting attributes so firstly the user has to define conditions for each

class, the data mine system then constructs descriptions for the classes. Basically the

system should given a case or tuple with certain known attribute values be able to predict

what class this case belongs to.

Once classes are defined the system should infer rules that govern the

classification therefore the system should be able to find the description of each class.

The descriptions should only refer to the predicting attributes of the training set so that

the positive examples should satisfy the description and none of the negative. A rule said

to be correct if its description covers all the positive examples and none o f the negative

examples of a class.

7/24/2019 10746328




49

A rule is generally presented as, if the left hand side (LHS) then the right hand

side (RHS), so that in all instances where LHS is true then RHS is also true, are very

probable. The categories of rules are:

exact rule - permits no exceptions so each object of LHS must be an element of RHS

strong rule - allows some exceptions, but the exceptions have a given limit

probablistic rule - relates the conditional probability P(RHS|LHS) to the probability

P(RHS)

Other types of rules are classification rules where LHS is a sufficient condition to

classify objects as belonging to the concept referred to in the RHS.

2.2 Associations

Given a collection of items and a set of records, each of which contain some

number of items from the given co llection, an association function is an operation against

this set of records which return affinities or patterns that exist among the collection of

items. These patterns can be expressed by rules such as "72% of all the records that

contain items A, B and C also contain items D and E." The specific percentage ofoccurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule,

A,B and C are said to be on an opposite side of the rule to D and E. Associations can

involve any number of items on either side of the rule.

Another example of the use of associations is the analysis of the claim forms

submitted by patients to a medical insurance company. Every claim form contains a set of

medical procedures that were performed on a given patient during one visit. By defining

the set of items to be the collection of all medical procedures that can be performed on a

patient and the records to correspond to each claim form, the application can find, using

the association function, relationships among medical procedures that are often

performed together.

7/24/2019 10746328




50

2.3 Sequential/Temporal patterns

Sequential/temporal pattern functions analyze a collection of records over a

period of time for example to identify trends. Where the identity of a customer who made

a purchase is known an analysis can be made of the collection of related records of the

same structure (i.e. consisting of a number of items drawn from a given collection of

items). The records are related by the identity of the customer who did the repeated

purchases. Such a situation is typica l of a direct mail application where for example a

catalogue merchant has the information, for each customer, of the sets of products that

the customer buys in every purchase order. A sequential pattern function will analyze

such collections of related records and will detect frequently occurring patterns of

products bought over time. A sequential pattern operator could also be used to discover

for example the set of purchases that frequently precedes the purchase of a microwave

oven. Sequential pattern mining functions are quite powerful and can be used to detect

the set of customers associated with some frequent buying patterns. Use of these

functions on for example a set of insurance claims can lead to the identification of

frequently occurring sequences of medical procedures applied to patients which can help

identify good medical practices as well as to potentially detect some medical insurance

fraud.

2.4 Clustering/Segmentation

Clustering and segmentation are the processes of creating a partition so that all the

members of each set of the partition are similar according to some metric. A cluster is a

set of objects grouped together because of their similarity or proximity. Objects are often

decomposed into an exhaustive and/or mutually exclusive set of clusters.

Clustering according to similarity is a very powerful technique, the key to it being

to translate some intuitive measure of similarity into a quantitative measure. When

learning is unsupervised then the system has to discover its own classes i.e. the system

7/24/2019 10746328




51

clusters the data in the database. The system has to discover subsets of related objects in

the training set and then it has to find descriptions that describe each of these subsets.

There are a number of approaches for forming clusters. One approach is to form

rules which dictate membership in the same group based on the level of similarity

between members. Another approach is to build set functions that measure some property

of partitions as functions of some parameter of the part ition.

2.5 Summary :

In this unit you studied classifications, associations, sequential temporal patterns

and

clustering / segmentation data mining functions. Supervise and unsupervised learning

techniques play a vital role in data mining.

2.6 Question / Answer Key

1. ________________ rule permits no exceptions so each object of LHS must be

an element of RHS

2. ________________ pattern functions analyze a collection of records over a

period of time

3. A ________________ is a set of objects grouped together because of their

similarity or proximity.

Answers

1. Exact

2. Sequential / Temporal

3. Cluster.

7/24/2019 10746328




52

BLOCK - 3

DATA MINING TECHNIQUES

3.0 Introduction

3.1 Cluster Analysis

3.2 Induction

3.3 Neural Networks

3.4 On-line Analytical processing

3.5

Data Visualization

3.6 Summary

3 Introduction

Learning procedure can be classified into two categories. Supervised learning and

unsupervised learning. In the case of supervised learning we know the exact value, with

respect to exact value we compare the output, this procedure is repeated until the desired

value is obtained. In the case of unsupervised learning without knowing the target value

we extract new facts. In this unit you will learn different data mining techniques.

3.1 Cluster Analysis

7/24/2019 10746328




53

In an unsupervised learning environment the system has to discover its own classes

and one way in which it does this is to cluster the data in the database as shown in the

following diagram. The first step is to discover subsets of related objects and then find

descriptions e.eg D1, D2, D3 etc. which describe each of these subsets.

Clustering and segmentation basically partition the database so that each partition or

group is similar according to some criteria or metric. Clustering according to similarity is

a concept which appears in many disciplines. If a measure of similarity is available there

are a number of techniques for forming clusters. Membership of groups can be based on

the level of similarity between members and from this the rules of membership can be

defined. Another approach is to build set functions that measure some property of

partitions ie groups or subsets as functions of some parameter of the partition. This latter

approach achieves what is known as optimal partitioning.

Many data mining applications make use of clustering according to similarity for

example to segment a client/customer base. Clustering according to optimization of set

functions is used in data analysis e.g. when setting insurance tariffs the customers can be

segmented according to a number of parameters and the optimal tariff segmentation

achieved.

Clustering/segmentation in databases are the processes of separating a data set into

components that reflect a consistent pattern of behavior. Once the patterns have been

established they can then be used to "deconstruct" data into more understandable subsets

and also they provide sub-groups of a population for further analysis or action which is

important when dealing with very large databases. For example a database could be used

for profile generation for target marketing where previous response to mailing campaigns

can be used to

generate a profile of people who responded and this can be used to predict response and

filter mailing lists to achieve the best response.

3.2 Induction

7/24/2019 10746328




54

A database is a store of information but more important is the information which

can be inferred from it. There are two main inference techniques available ie deduction

and induction.

Deduction is a technique to infer information that is a logical consequence of the

information in the database e.g. the join operator applied to two relational tables

where the first concerns employees and departments and the second departments and

managers infers a relation between employee and managers.

Induction has been described earlier as the technique to infer information that is

generalised from the database as in the example mentioned above to infer that each

employee has a manager. This is higher level information or knowledge in that it is a

general statement about objects in the database. The database is searched for

patterns or regularities.

Induction has been used in the following ways within data mining.

3.2.1 Decision trees

Decision trees are simple knowledge representation and they classify examples to

a finite number of classes, the nodes are labeled with attribute names, the edges arelabeled with possible values for this attribute and the leaves labeled with different classes.

Objects are classified by following a path down the tree, by taking the edges,

corresponding to the values of the attributes in an object.

The following is an example of objects that describe the weather at a given time.

The objects contain information on the outlook, humidity etc. Some objects are positive

examples denote by P and others are negative i.e. N. Classification is in this case the

construction of a tree structure, illustrated in the following diagram, which can be used to

classify all the objects correctly.

3.2.2 Rule induction

7/24/2019 10746328




55

A data mine system has to infer a model from the database that is it may define

classes such that the database contains one or more attributes that denote the class of a

tuple ie the predicted attributes while the remaining attributes are the predicting

attributes. Class can then be defined by condition on the attributes. When the classes are

defined the system should be able to infer the rules that govern classification, in other

words the system should find the description of each class.

Production rules have been widely used to represent knowledge in expert systems

and they have the advantage of being easily interpreted by human experts because of their

modularity i.e. a single rule can be understood in isolation and doesn't need reference to

other rules. The prepositional like structure of such rules has been described earlier but

can summed up as if-then rules.

3.3 Neural networks

Neural networks are an approach to computing that involves developing

mathematical structures with the ability to learn. The methods are the result of academic

investigations to model nervous system learning. Neural networks have the remarkable

ability to derive meaning from complicated or imprecise data and can be used to extract

patterns and detect trends that are too complex to be noticed by either humans or other

computer techniques. A trained neural network can be thought of as an "expert" in the

category of information it has been given to analyze. This expert can then be used to

provide projections given new situations of interest and answer "what if" questions.

Neural networks have broad applicability to real world business problems and

have already been successfully applied in many industries. Since neural networks are best

at identifying patterns or trends in data, they are well suited for prediction or forecasting

needs including:

sales forecasting

industrial process control

customer research

7/24/2019 10746328




56

data validation

risk management

target marketing etc.

Neural networks use a set of processing elements (o r nodes) analogous to neurons in

the brain. These processing elements are interconnected in a network that can then

identify patterns in data once it is exposed to the data, i.e the network learns from

experience just as people do. This distinguishes neural networks from traditional

computing programs, that simply follow instructions in a fixed sequential order.

The structure of a neural network looks something like the following:

3.4 On-line Analytical processing

A major issue in information processing is how to process larger and larger

databases, containing increasingly complex data, without sacrificing response time. The

client/server architecture gives organizations the opportunity to deploy specialized

servers which are optimized for handling specific data management problems. Until

recently, organizations have tried to target relational database management systems

(RDBMSs) for the complete spectrum of database applications. It is however apparentthat there are major categories of database applications which are not suitably serviced by

relational database systems. Oracle, for example, has built a totally new Media Server for

handling multimedia applications. Sybase uses an object-oriented DBMS (OODBMS) in

its Gain Momentum product which is designed to handle complex data such as images

and audio. Another category of applications is that of on-line analytical processing

(OLAP). OLAP was a term coined by E F Codd (1993) and was defined by him as;

the dynamic synthesis, analysis and consolidation of large volumes of

multidimensional data

Codd has developed rules or requirements for an OLAP system;

7/24/2019 10746328




57

multidimensional conceptual view

transparency

accessibility

consistent reporting performance

client/server architecture

generic dimensionality

dynamic sparse matrix handling

multi-user support

unrestricted cross dimensional operations

intuitative data manipulation

flexible reporting

unlimited dimensions and aggregation levels

An alternative definition of OLAP has been supplied by Nigel Pendse who unlike

Codd does not mix technology prescriptions with application requirements. Pendse

defines OLAP as, Fast Analysis of Shared Multidimensional Information which means;

Fast in that users should get a response in seconds and so doesn't lose their chain of

thought;

Analysis in that the system can provide analysis functions in an intuitive manner and

that the functions should supply business logic and statistical analysis relevant to the

users application;

Shared from the point of view of supporting multiple users concurrently;

Multidimensional as a main requirement so that the system supplies a

multidimensional conceptual view of the data including support for multiple hierarchies;

Information is the data and the derived information required by the user application.

7/24/2019 10746328




58

One question is what is multidimensional data and when does it become OLAP? It is

essentially a way to build associations between dissimilar pieces of information using

predefined business rules about the information you are using. Kirk Cruikshank of Arbor

Software has identified three components to OLAP, in an issue of UNIX News on data

warehousing;

A multidimensional database must be able to express complex business calculations

very easily. The data must be referenced and mathematics defined. In a relational

system there is no relation between line items which makes it very difficult to express

business mathematics.

Intuitative navigation in order to `roam around' data which requires mining

hierarchies.

Instant response i.e. the need to give the user the information as quick as possible.

Dimensional databases are not without problem as they are not suited to storing all

types of data such as lists for example customer addresses and purchase orders etc.

Relational systems are also superior in security, backup and replication services as these

tend not to be available at the same level in dimensional systems. The advantages of a

dimensional system are the freedom they offer in that the user is free to explore the dataand receive the type of

report they want without being restricted to a set format.

3.5 Data Visualization

Data visualisation makes it possible for the analyst to gain a deeper, more

intuitive understanding of the data and as such can work well along side data mining.

Data mining allows the analyst to focus on certain patterns and trends and explore in-

depth using visualisation. On its own data visualisation can be overwhelmed by the

7/24/2019 10746328




59

volume of data in a database but in conjunction with data mining can he lp with

exploration.

3.6 Summary

In this unit, you studies various data mining techniques. Each method is having

its own advantage and drawback. Depending on the application one should choose the

method.

BLOCK - 4

7/24/2019 10746328




60

KNOWLEDGE DISCOVERY FROM DATABASE (KDD)

4.0 Introduction

4.1 view points

4.2 Classificat ion Method

4.3 steps of a KDD process

4.4 KDD Application

4.5 Related F ields

4.6 Summary

4.7 Question/Answer key

4.0 Introduction

We know lots of data is being collected and warehoused. Data collected and

stored at enamels speeds. Traditional techniques are infeasible for raw data. Hence data

mining is used for data reduction.

4.1 View points.

From commercial point of view, data mining provides better, customized services

for the user. Information is becoming product in its own right we know traditional

techniques is not suitable because of enormity of data, high dimensionality of data,

heterogeneous distributed nature of data. Hence we can use some prediction methods i.e.

we find human - interpretable patterns that describe the data.

Knowledge discovery in Data bases (KDD) is an emerging field that combines

techniques from machine learning, pattern recognition, statistics, Databases and

visualization to automatically extract concepts, concept interrelations and patterns of

interest from large databases. The basic task is to extract knowledge (or information)

from lower level data (databases). The basic tools used to extract patterns from data are

7/24/2019 10746328




61

called data mining methods. While the process of surrounding the usage of these tools

(including pre-processing, selection and transformation of the data) and the interpretation

of patterns into knowledge is the KDD Process.

This extracted knowledge is subsequently used to support human decision

making. The use of KDD systems alleviates the problem of manually analyzing the large

amounts of collected data which decision makers face currently KDD systems have been

implemented and currently in use in finance, fraud detectio n, market data analysis,

astronomy, etc., Problems in KDD include representation of the extracted knowledge,

search complexity, the use of prior knowledge to improve the discovery process,

controlling the discovery operation, statistical internee and selecting the most appropriate

data mining method(s) to apply on a particular set.

4.2 Classification Method :

In this approach, a collect ion of records (training set) is given, each record

contains a set of attributes, one of the attributes is the class. After that we should find a

model for class attributed as a function of the values of other attributes.

Building accurate and efficient classifiers for large data bases is one of the

essential tasks of data mining and machine learning research. Give a set of cases with

class labels as a training set, classification is build a model (called classifier) to predict

future data objects for which the class label is unknown.

Recent studies propose the extraction of a set of high quality association rules

form the training data set which satisfy certain user specified frequency and confidence

thresholds.

Suppose a data object

7/24/2019 10746328




62

Obj = { a1, a2, …….. an } follows the schema (A1, A2 ……. An) were A1

……… An are ca lled attributes. Attributes can be categorica l or continuous. For a

categorical attribute, we assume that all the possible values are mapped to a set of

concessive positive integers. For a continuous attribute, we assume that its value range is

discredited into intervals and the internals are also mapped to consecution positive

integers.

Let C={c1 , ……… cm} be a finite set of class labels. A training data set is a set

of data objects, such that, for each object obj, there exists a class label c obj E C

associated with it. A classifier C is a function from (A1 . . . . . . . . An) to C. Given a data

object obj, c(obj) C return a class label.

In general, given training data set, the task of classification is to build a classifier

from the training data set such that it can be used to predict class labels of unknown

objects with high accuracy.

4.3 steps of a KDD process

a) learning the application domain - relevant prior knowledge and goals of application.

b) Creating a target data set : data selection

c) Data clearing and transformation – Find useful features, dimensionality / variable

reduction, invariant representat ion.

d) Data reduction and transformation - Find useful features, dimensionality / variable

reduction, invariant represenation.

e) Chossing function of data mining - summarization, classification, regression.

f) Choosing the mining algorithms

g) Data mining - search for patterns of interest

h)

Pattern evaluation and knowledge presentation - visualization, transformation,

removing redundant patterns

i) Use of discovered knowledge

Following figure shows the KDD Process

7/24/2019 10746328




63

Selection Preprocessing Transformation

Datamining

Target data preprocessed transformedData data

Interpretational Evaluation

Pattern Knowledge4.4 KDD Application

The rapidly emerging field of knowledge discovery in databases (KDD) has

grown significantly in the past few years. This growth is driven by a mix of daunting

practical needs and strong research interest. The technology for computing and storage

has enabled people to collect and store information from a wide range of sources at rates

that were, only a few years ago, considered unimaginable. Although modern database

technology enables economical storage of these large streams of data, we do not yet have

the technology to help us analyze, understand, or even visualize this stored data.

Examples of this phenomenon abound in a wide spectrum of fields: finance,

banking, retail sales, manufacturing, monitoring and diagnosis (be it of humans or

machines), health care, marketing, and science data acquisition, among others.

Why are today's database and automated match and retrieval technologies not

adequate for addressing the analysis needs? The answer lies in the fact that the patterns to

be searched for, and the models to be extracted, are typically subtle and require

significant specific domain knowledge. For example, consider a credit card company

wishing to analyze its recent transactions to detect fraudulent use or to use the individual

history of customers to decide on-line whether an incoming new charge is likely to be

from an unauthorized user. This is clearly not an easy classification problem to solve.

data

7/24/2019 10746328




64

One can imagine constructing a set of selection filters that trigger a set of queries

to check if a particular customer has made similar purchases in the past, or if the amount

or the purchase location is unusual, for example. However, such a mechanism must

account for changing tastes, shifting trends, and perhaps travel or change of residence.

Such a problem is inherently probabilistic and would require a reasoning-with-

uncertainty scheme to properly handle the trade-offs between disallowing a charge and

risking a false alarm, which might result in the loss of a sale (or even a customer).

In the past, we could rely on human analysts to perform the necessary analysis.

Essentially, this meant transforming the problem into one of simply retrieving data,

displaying it to an analyst, and relying on expert knowledge to reach a decision.

However, with large databases, a simple query can easily return hundreds or thousands

(or even more) matches. Presenting the data, letting the analyst digest it, and enabling a

quick (and correct) decision becomes infeasible. Data visualization techniques can

significantly assist this process, but ultimately the reliance on the human in the loop

becomes a major bottleneck. (Visualization works only for small sets and a small number

of variables. Hence, the problem becomes one of finding the appropriate transformations

and reductions--typica lly just as difficult as the original problem.)

Finally, there are situations where one would like to search for patterns that

humans are not well-suited to find. Typically, this involves statistical modeling, followed

by "outlier" detection, pattern recognition over large data sets, classificat ion, or

clustering. (Outliers are data points that do not fit within a hypothesisís probabilistic

mode and hence are likely the result of interference from another process.) Most database

management systems (DBMSs) do not allow the type of access and data manipulation

that these tasks require; there are also serious computational and theoretical problems

attached to performing data modeling in high-dimensional spaces and with large amounts

of data.

4.5 Related fields

7/24/2019 10746328




65

By definition, KDD is an interdisciplinary field that brings together researchers

and practitioners from a wide variety of fields. The major related fields include statistics,

machine learning, artificial intelligence and reasoning with uncertainty, databases,

knowledge acquisition, pattern recognition, information retrieval, visualization,

intelligent agents for distributed and multimedia environments, digital libraries, and

management information systems.

The remainder of this article briefly outlines how some of these relate to the

various parts of the KDD process. I focus on the main fields and hope to clarify to the

reader the role of each of the fields and how they fit together naturally when unified

under the goals and applications of the overall KDD process. A detailed or

comprehensive coverage of how they relate to the KDD process would be too lengthy and

not very useful because ultimately one can find relations to every step from each of the

fields. The article aims to give a general review and paint with a broad brush. By no

means is this intended to be a guide to the literature, neither do I aim at being

comprehensive in any sense of the word.

Statistics. Statistics plays an important role primarily in data selection and sampling, data

mining, and evaluation of extracted knowledge steps. Historically, most statistics work

has focused on evaluation of model fit to data and on hypothesis testing. These are clearly

relevant to evaluating the results of data mining to filter the good from the bad, as well as

within the data-mining step itself in searching for, parametrizing, and fitting models to

data. On the front end, sampling schemes play an important role in selecting which data

to feed to the data-mining step. For the data-cleaning step, statistics offers techniques for

detecting "outliers," smoothing data when necessary, and estimating noise parameters. To

a lesser degree, estimation techniques for dealing with missing data are also available.

Finally, for exploratory data analysis, some techniques in clustering and design of

experiments come into play. However, the focus of research has dealt primarily with

small data sets and addressing small sample problems.

7/24/2019 10746328




66

On the limitations front, work in statistics has focused mostly on theoretical

aspects of techniques and models. Thus, most work focuses on linear models, additive

Gaussian noise models, parameter estimation, and parametric methods for a fairly

restricted class of models. Search has received little emphasis, with emphasis on closed-

form analytical solutions whenever possible. While the latter is very desirable both

computationally and theoretically, in many practical situations a user might not have the

necessary background statistics knowledge (which can often be substantial) to

appropriately use and apply the methods. Furthermore, the typical approaches require an

a priori model and significant domain knowledge of the data as well as of the underlying

mathematics for proper use and interpretation. In addition, issues having to do with

interfaces to databases, dealing with massive data sets, and techniques for efficient data

management have only recently begun to receive attention in statistics

Pattern recognition, machine learning, and artificial intelligence. In pattern

recognition, work has historically focused on practical techniques with an appropriate

mix of rigor and formalism. The major applicable techniques fall under the category of

classification learning and clustering. Hence, most pattern-recognition work contributes

to the data-mining step in the process. Significant work in dimensionality reduction,

transformations, and projections has relevance to the corresponding step in the KDD

process.

Within the data-mining step, pattern-recognition contributions are distinguished

from statistics by their emphasis on computational algorithms, more sophisticated data

structures, and more search, both parametric and nonparametric. Given its strong ties to

image analysis and problems in 2D signal processing, work in pattern recognition did not

emphasize algorithms for dealing with symbolic and categorical data. Classification

techniques applied to categorical data typically take the approach of mapping the data to

a metric space norms .Such a mapping is often not easy to formulate meaningfully: Is the

distance between the values "square" and "circle" for the variable shape greater than the

distance between "male" and"female" for the variable sex?

7/24/2019 10746328




67

Databases and data warehouses. The relevance of the field of databases to KDD is

obvious from the name. Databases provide the necessary infrastructure to store, access,

and manipulate the raw data. With parallel and distributed database management systems,

they provide the essential layers to insulate the analysis for the extensive details of how

the data is stored and retrieved. I focus here only on the aspects of database research

relevant to the data-mining step. A strongly related term is on-line analytical processing,

which mainly concerns providing new ways of manipulating and analyzing data using

multidimensional methods. This has been primarily driven by the need to overcome

limitations posed by SQL and relational DBMS schemes for storing and accessing data.

The efficiencies achieved via relational structure and normalization's can pose significant

challenges to algorithms that require special access to the data: in data mining, one would

need to collect statistics and counts based on various partitioning of the data, which

would require excessive joins and new tables to be generated. Supporting operations from

the data-mining perspective is an emerging research area in the database community. In

the data-mining step itself, new approaches for functional dependency analysis and

efficient methods for finding association rules directly from databases have emerged and

are starting to appear as products. In addition, classical database techniques for query

optimization and new object-oriented databases make the task of searching for patterns in

databases much more tenable.

An emerging area in databases is data warehousing, which is concerned with

schemes and methods of integrating legacy databases, on-line transaction databases, and

various nonhomogeneous RDBMSs so that they can be accessed in a uniform and easily

managed framework. Data warehousing primarily involves storage, data se lection, data

cleaning, and infrastructure for updating databases once new knowledge or

representations are developed.

4.6 Summary

Knowledge discovery in data bases (KDD) is an emerging field that combines

techniques from machine learning, pattern recognition, statistic, databases. In this unit

you studied the steps that are involved in KDD processes. Few applications are

challenging and today lot of research work is on.

7/24/2019 10746328




68

4.7 Question / Answer Keys

1. The basic tools used to extract patterns from data are called _______methods.

2. In classification method a collection of records (training set) is given, each

record contains a set of__________________ one of the attributes is the class.

3. ______________ plays an important role primarily in data selection and

sampling, data mining, and evaluation of extracted knowledge steps.

Answers

1. data mining

2. attributes

3. Statistics

BLOCK - 5

WEB DATA MINING

7/24/2019 10746328




69

5.0 Introduction

5.1 Methods

5.2 Web content Mining

5.3 Web structure Mining

5.4 Web usage Mining

5.5 The usage mining on the web

5.6 Privacy on the web

5.7 Summary

5.8 Question and Answers key

5.0INTRODUCTION

Web data mining is the use of data mining techniques to automatically discover

and extract information from world wide web documents and services. Today, with the

tremendous growth of the data sources available on the web and the dramatic popularity

of the data sources available on the web and the dramatic popularity of e-commerce in

the business community.

5.1 Methods

Web mining is a technique to discover and analyze the useful information from

the web data web mining is decomposed into the following tasks

a) Resource discovery : the task of retrieving the intended information from

Web.

b) Information Extraction : automatically selecting and preprocessing specific

information from the retrieved web resources.

c) Generalization : automatically discovers general patterns at the both

individual web sites and across multiple sites.

d) Analysis : analyzing the mined pattern.

7/24/2019 10746328




70

5.2 Web Content Mining

Web content mining describes the automatic search of information resources

available online and involves mining web data contents. In the web mining domain, web

content mining essentially is an analog of data mining techniques for relational data

bases, since it is possible to find similar types of knowledge from the unstructured data

residing in web documents. The web document usually contains several types of data,

such as text, image, audio, video, meta data and hyperlinks. Some of them are semi –

structured such as HTML documents or a more structured data like the data in the tables

or database generated HTML pages, but most of the data is unstructured text data. The

unstructured characteristic of web data force the web content mining towards a more

complicated approach.

Web content mining is based on the statistics about single words in isolation, to

represent unstructured text and take single word found in the training corpus as features.

Multimedia data mining is part of the content mining, which is engaged to mine

the high-level information and knowledge from large online multimedia sources.

Multimedia data mining on the web has gained many researcher‟s attention recently.

Working towards a unifying framework for representation, problem solving and learning

from multimedia is really a challenge, this research area is still in its infancy indeed,

many works are waiting to be done.

5.3 Web structure Mining

Most of the web information retrieval tools only use the textural information,

while ignores the link information that could be very valuable. The goal of web structure

mining is to generate structural summary about the web site and web page.

7/24/2019 10746328




71

Technically, web content mining mainly focuses on the structures of inner –

documents, while web structure mining tries to discover the link structure of the

hyperlinks at the inner – document level. Based on the topology of the hyperlinks, web

structure mining will categorize the web pages and generate the information, such as

similarity and relationship between different web sites.

If a web page is linked to another web page directly or the web pages are

neighbors, we would like to discover the relationships among those web pages. The

relations may be fall in one of the types, such as they related by s ynonyms or antilogy,

they may have similar contents, both of them may sit in the same web server therefore

created by the same person. Another task of web structure mining is to discover the

nature of hierarchy or network of hyperlinks in the web sites o f a particular domain. This

may help to generalize the flow information in web sites that may represent some

particular domain, therefore the query processing will be easier and more efficient.

5.4 Web usage Mining

Web usage mining tries to discovery the useful information from the secondary

data derived from the interaction o the users while surfing on the web. It focuses on the

techniques that could predict user behavior while the user interacts with web. In the

process of data preparation of web usage mining, the web content and web s ite topology

will be used as the information sources which interacts web usage mining with the web

content mining and web structure mining. The clustering in the process of pattern

discovery is a bridge to web content and structure mining from usage mining.

5.5. The usage Mining on the web

Web usage mining is the application of data mining techniques to discover usage

patterns from web data, in order to understand and better serve the needs of web based

application.

7/24/2019 10746328




72

Web usage mining is parsed into three distinctive phases : preprocessing, pattern

discovery, and pattern analysis.

Preprocessing : Web usage mining is the application of date mining techniques to usage

logs (secondary web data) of large web data repositories. The purpose of it is to produce

results that can used in the design tasks such as web site design, web server design and

navigating through a web site. Before applying the data mining algorithm, we must

perform a data preparation to convert the raw data into the data abs traction necessary for

the further process.

Pattern discovery : Pattern discovery converges the algorithms and techniques from

several research areas, such as data mining, machine learning, statistics and pattern

recognition.

Pattern Analysis : Pattern analysis is a final stage of the whole web usage mining. The

goal of this process is to eliminate the irrelative rules or patterns and to extract the

interesting rules or patterns from the output of the pattern discovery process. There are

two most common approaches for the pattern analysis. One is to use the knowledge

query mechanism such as SQL, while another is to construct multi dimensional data cube

before perform OLAP operations.

5.6 Privacy on the Web

Due to massive growth of the e-commerce, privacy becomes a sensitive topic and

attracts more and more attention recently. The basic goal of web mining is to extract

information from data set for business needs, which determines its application is highly

customer – related. The lack of regulations in the use and deployment of web mining

systems and the widely spread privacy abuses reports related to data mining has made

privacy a hot iron like never before. Privacy touches a central nerve with people and

there are no easy solutions.

7/24/2019 10746328




73

5.7 Summary

In this unit, you studied the area of web data mining with the focus on the web

usage mining. Web mining requires three stages reprocessing, pattern discovery and

pattern analysis.

5.8 Question/Answer Keys

1.

_________________ the task of retrieving the intended information from

Web.

2.

Web content mining describes the ____________________of information

resources available online and involves mining web data contents.

3.

Web content mining mainly focuses on the structures of inner – documents,

while web structure mining tries to discover the _________of the hyperlinks

at the inner – document level

Answer

1. Resource discovery

2.

automatic search

3.

link structure

7/24/2019 10746328




74

UNIT III

Unit Introduction

Having learnt the fundamentals of warehousing, in this unit, we list out the

various areas in which warehousing becomes useful, especially in the central and state

government sectors.

You will also be introducted to a case study: That of the Andhra Pradesh

information warehouse. This case study is expected to underline the various concepts

discussed in the previous unit and provide a practical bias to the entire concept of

dataware housing.

A term project is suggested to further drive home the complexities and intricacies

involved in the process.

Since this unit is to be studied in totality, no summary or review questions are

included.

7/24/2019 10746328




75

BLOCK I

Block Introduction

In this block you will be briefly introduced to the various applications, possible,

of the data warehouse concept. Because of familiarity, the present and suggestedapplications of warehousing technique at the government level have been briefly

described. This block is expected to give you some insight into the practical applications

of the data warehouse.

Contents:

1. Areas of applications of Data warehousing

2. Data warehousing technologies in the government

3.

Government of India warehouses

i. Data warehousing of census data

ii. Monitoring of essential commodities

iii. Ministry of commerce

iv. Ministry of Education

7/24/2019 10746328




76

1. AREAS OF APPLICATIONS OF DATAWARE HOUSING & DATA MINING

Having seen so much about datamining and data warehousing, the question arises

as to it‟s areas of application. Of course, the business community are the first users of the

technique – they feed in their and their competitors results, trends etc. and come out with

tangible strategies for the future. Obviously, there can be as many deviations and

modifications to the concept of warehousing as the types of business. However, to learn

about them, one should first know the types of business practices, their various strategies

etc.. before one can appreciate the warehousing techniques. Instead, in this block, we

choose the safer option of going through the various applications at the government

levels. For two reasons, this promises to be a good procedure – one, all of us have an

idea to some extent, how the government machinery works. Secondly, a lot of literature

is available about the implementations. However, we should underline the fact, that we

will be more bothered more about the techniques and technologies, rather than the actual

outputs & results.

2. DATAWARE HOUSING TECHNOLOGIES AT THE GOVERNMENT

It is obvious that in a large country like India, datamining and data warehousing

technologies have extensive potential for being used in a variety of activities – several

central government sectors like Agriculture, Commerce, rural development, health,

tourism and soon. In fact, even before the advent of the data warehousing technologies,

there was an attempt to computerize the available data and use it for facilitating the

decision making process. However, there have been several attempts to work

meticulously with the available data, over the last decade.

Similarly, several state governments, especially those who are forerunners in IT

industry, have tried to exploit the technology in various areas. Needless to say, a lot

more needs to be done than what has been achieved. In the next sections, we briefly list

the various areas that have been identified for the data warehousing applications. In the

next block, we see a detailed case study of the Andhra Pradesh Information warehouse,

which should give the learners a grasp over the concepts we have studied earlier.

7/24/2019 10746328




77

3. GOVERNMENT OF IND IA WAREHOUSES:

i) A Data warehouse of census data

The government of India conducts a census of the population of the country, that

should be a store house of all types of information for the planning process of the

country. Though, they are presently being processed manually or even with database

technologies, the data available in them in so varied and complex that it is ideally suited

for the data warehousing techniques. Information about wide ranging areas at various

levels (village, district, state etc..) can be extracted and compiled using OLAP techniques.

In fact, a village level census analysis software has been developed by the

National Informatics Centre (NIC). This software gives details in two parts : primary

census abstract and the details about the various amenities. This software has been used

on a trial basis to get the various views of the development scenario of selected villages

in the country, using the 1991 census data. Efforts are on to use technology on a much

larger scale in the subsequent census data.

It is easy to see why the census data is ideally suited for data warehousing

applications. Firstly, it is a reasonably static data, which is updated only once in ten

years. Secondly, since unbelievably large volumes of complex data is available, the

benefit of technology over the other methods of extracting information is obvious even at

first sight. Thirdly, almost all the concepts of data warehousing become applicable in the

application.

ii) Monitoring of essential commadities

The government of India compiles data on prices of essential commodities like

rice, wheat, pulses, edible oils etc. The prices are collected at every week end, of the

prices of these commodities on every day of the previous week, in selected centres across

the country. These are consolidated to give the government an insight about the trends

and also allow the government to device strategies on various agricultural policies.

Again, because of the geographical spread, the periodicity of updating etc., this becomes

7/24/2019 10746328




78

an ideal ease for OLAP technology application - especially because of the network

facility available on a countrywide basis.

iii) Ministry of commerce

The ministry of commerce has to constantly monitor the prices, quantum of

exports, imports and stock levels of several commodities to take appropriate steps to

boost exports and also to device a EXIM policy(Export - Import policy) that suits the

country's industries both in short and long terms. It should constantly take into account

the various trends ( both at the national and international levels) so that our exports

continue to be competitive in the global markets, at the same time ensuring that our

industries are not swamped by foreign goods. The situation became more complex after

the opening up of our economy to global influences in 1991.

To ensure this, the ministry of commerce has setup several export processing

zones across the country, which compile data about the various export-import

transactions in their selected regions. These are then compiled to produce data for

decision making to the commerce ministry on a regular basis.

This being again a fit ease for data warehousing operation, the government has

drawn up a plan to make use of OLAP decision support tools for decision making. In fact

the data collection centres are already computerised, and in the second phase of

computerisation, the decision making process is expected to be based on the principles of

data mining and warehousing concepts.

iv) Ministry of Education

The latest all India education survey, which has given rise to a treasure house of

valuable data about the status of education across the country, has been converted into a

data warehouse. This is supporting various decision making queries. In addition, several

other departments are ideally suited to make use of data warehousing and data mining

technologies. Some of them have already initiated action in this direction as well. To list

a few of them -

7/24/2019 10746328




79

i) The ministry of rural development: Detailed surveys on the availability of

drinking water, number of people below the poverty line, available surplus

land for distribution etc.. have been computerised at various stages in the

last decade. A consolidation of these into a warehouse is being

contemplated.

ii) The ministry of tourism has already collected valuable data regarding the

pattern of tourist arrivals, their choices and spending patterns etc. Details

about primary tourist spots are also available. They can be combined to

produce a data warehouse to support decision making.

iii) The ministry of Agriculture conducts an agriculture census, based on

remote sensing, random sampling etc to compile data about the cropping

patterns, expected yields, input of seeds and fertilizers, livestock data etc.

Also areas under irrigation, rainfall patterns, forecasts etc.. are routinely

compiled. These can be combined into a data warehouse to aid decision

making.

In addition, several areas like planning, health, economic affairs .... etc are ideally

suited to make use of OLAP tools. Conventionally, many of these departments are

computerised and routinely producing MIS reports and hence are maintaining

medium to large size databases. The next logical sequence, is to convert these

databases and MIS know how into a full fledged data warehouses. This would

result in a paradigm shift, as far as data utilisation is concerned. Since the utility

of most types of data is time bound, enormous delays in extracting information

out of them would make the information time barred and hence of little use.

Further, such warehouses, when they come into existence, would release the

expert manpower now spent on processing the data for data analysis and decision

making.

The next stage, obviously, is to link these departmental warehouses. It is

obvious, even to an outsider, that most of these departments, can not work in

isolation. Hence, unless the departments can avail selected data from the

7/24/2019 10746328




80

warehouses of other departments, their decision making remains incomplete.

Ofcourse, a lot of checks and balances need to be put in place before such huge,

multidimensional, warehouses are made functional. But the goal should be to

have a consolidated central government warehouse and corresponding state

government warehouses.

7/24/2019 10746328




81

Block II

Data Warehouse – A case study

You will be introduced to a practical data warehouse design case- that of

the Andhra Pradesh Information Warehouse. The various stages of the

development have been described, in the context of what has already been

discussed in the previous sections the various tradeoffs involved have been

discussed to the lowest possible detail. At the end of the block, you are expected

to have become more comfortable with the practical aspects of the warehouse

techniques.

Contents:

1.

Introduction

2. Concepts used in Developing the warehouse.

3. Data Sources

a) MPHS

b) Land Suite applications

c) Maps and dictionaries

4. Possible users of information

i. Policy planners

ii. Custodian

iii. Warehouse developers

iv. Citizens

5. Conversion of data to information

i. Data conversion

ii. Data scrubbing

iii. Data transformation

iv.

Web publishing

6. Identifying hardware and software

7. Choice of data structures and dimensions

8. Term Exercises

7/24/2019 10746328




82

1. INTRODUCTION

In this block, we look in detail, the process of development of the Andhra

Pradesh Information Warehouse. As specified earlier, we will be more interested

in the technological & technical aspects rather than the administrative details.

The Andhra Pradesh government has undertaken a project to develop a

state data warehouse, with the 'person' identified as the smallest entity of the data

repository. Put the otherway, the state government extracted information from it's

Multipurpose House hold Survey(MPHS) and it's computerised land records. The

idea was to link the 'land' and 'people' entities to produce a conceptually clean

data warehouse.

This data warehouse is expected to provide planners with sufficient inputs

to assess the impact of their various welfare schemes, on various sections of

society. It is possible for them to choose different target groups like urban slum

dwellers, industrial works, agricultural laborers etc.. and review their status with

reference to various parameters like economy, education, housing, health etc.. The

data so generated can be used provide for planning schemes, specifically targeted

towards any one or more of these groups. By a logical extension of the concept, it

should be possible for the policy makers to assess the impact of their welfare

programs over these target groups, during the progress o f their programs. Such a

scenario is expected to help the policy planners and executives to keep their

decisions purposeful and focussed.

The actual warehouse was developed by C-DAC (Centre for Development

of Advanced Computing). In the next few sections, we see the base concepts of

schema used by them.

7/24/2019 10746328




83

2. THE CONCEPTS USED I N DEVELOPING THE WAREHOUSE:

These concepts have been discussed in detail in the earlier unit, but have

been included here to make the case study self contained and also to serve as a

ready reckoner.

The type of processing is a typical data warehouse Processing. The data is

stored in the form of tables(relations). Data can be accessed based on the keys.

Some queries are taken up by the OnLine Analytical Processing System

(OLAP) which is designed as a multidimensional database (or a collection of

them) and the user can query for complex analytical process. The databases are

normally optimised based on the previously known patterns of data entry and data

retrieval.

Drill down and Rollup analysis: Data available in the database is normally

arranged in several layers. The upper layers contain single data entities and their

details are hidden in the lower levels, each successive layer having detailed data

about the entities above it and the details of the present layer hidden in the next

lower layers. It is for the user to decide at what level he wants to see the data.

The process of beginning at a higher level and viewing data at the progressively

lower levels is called " drilling down" on the data. Conversely, one can view

data beginning at a detailed (lower) level and move up to concise (higher) levels.

This is called "rolling up" of data.

With these terminology in place, we go about " designing" the data

warehouse. Though, several alternative methods are possible and how the C-

DAC have gone around doing the same cannot be duplicated here, this exercise is

supposed to give an idea of the actual process of development in a nutshell.

7/24/2019 10746328




84

Now we see the various stages of development as follows:

1) Identify the data sources and the type of data available from them

2) Identify the users of the warehouse and the type of queries you can expect

from them

3) Identify the methods of converting the data sources (1) to data users (2)

above

4) Identify the hardware and software components

5) Finalise the type of queries that arise and ways of combining / standardising

them.

6) Look at the ways of storing data in such a format so that it can be efficiently

searched by most of the queries.

7) Finalise the data structures, analysis variables and methods of a calculation.

3. THE DATA SOURCES:

The Andhra Pradesh government basically decided to link the data entities "Land"

and "Person" and build the warehouse. Hence, the primary sources of data were the land

records ( which had already been computerised) and the person-related data were

collected from the Multi-Purpose Household Survey(MPHS) suite of applications.

a) MPHS: The government of Andhra Pradesh collected data from each house

hold regarding the socio-economic status of each family. This data, collected

originally for a different purpose, was available as MPHS suite of applications

in an electronic format. Relevant portions of this suite were made use of by

the government for building the warehouse.

b) Land suite of applications: This data, again, was already available, in which

land was the core entity of information. Again, relevant portions of these

records, were used for constructing the warehouse.

7/24/2019 10746328




85

c) Maps and dictionaries: Since the number of entities entering any reasonably

useful database is very large, normally codes, instead of names are used.

Dictionaries are maintained that relate the names to the unique codes. Of

course, depending on the entities and their applications, the codes are allotted.

These dictionaries ar e maintained by different “custodians”. Depending on

whether the data is land related, person related, social or educational, different

custodians allot and maintain these codes. Needless to say, the use of codes

greatly simplify the use of the entities.

However, depending on the areas of applications, each entity is allotted a different

code. For example, the school building may be given a different code, depending on

whether the dictionary pertains to educational, land use or social aspect. Thus, there

should be a mapping between the various dictionaries based on these different

classification schemes. Thus maps are maintained to interrelate one set of codes to

another, their validity be ing checked and updated at regular intervals. Again only

authorised custodians are allowed to maintain and modify such maps.

These dictionaries and maps are essential to store and manipulate the data objects.

4. POSSIBLE USERS OF INFORMATION

The next phase is to identify the users of the information that the warehouse

generates. We briefly discuss the proposed users of the information in the present case.

i) Policy Planners : These are the primary users of the information generated from the

warehouse. Since they are expected to use it to a maximum extent, the warehouse

queries need to be optimised to suite the type and pattern of queries generated by them.

Though, it may not be possible to anticipate the queries generated by them fully, a

reasonable guess about what typ0e of conclusion and decisions they like to draw from the

same can be ascertained, possibly through interviews and questionaires. Also, since they

are likely to be distributed all over the state and may be even outside, the warehouse

7/24/2019 10746328




86

needs to be web-enabled. They should also be able to copy sections of information (data

cubes) into their own machines and operate on them, since most of them are not likely to

be computer professionals, the entire operation should be seamless and transparent. All

these factors should be taken into account while finalising the optimisation parameters.

ii) Custodians : As seen earlier, the dictionaries and maps are maintained by custodians.

In addition, the object entities themselves need to be maintained by custodians. All these

dictionary custodians, map custodians and entity custodians will be responsible for

maintaining the entities and also for incorporating changes from time to time. For

example, the way the government treats a particular caste (SC/ST/backward), or a village

(backward/ forward/sensitive) or even persons may change from time to time. All that

government does is to issue a notification to the same effect. The concerned custodians

will be responsible for maintaining the validity of the entities of the warehouse.

However, again, they are not likely to be computer professionals (at least the map and

dictionary custodians) and hence they should be able to vies the entities in the way they

are accustomed to and be able to manage them.

iii) Warehouse developers, administrators and database administrators: They are the

persons, who actually are responsible for the day to day working of the warehouse. They

will be able to look at the repository from the practical point of view and decide about it's

capabilities and limitations. Their views are most sought after to decide about the

viability / otherwise of the warehouse.

iv) Citizens: The Government plans to make certain categories of data available to

ordinary citizens on the web. Since their background, type of information they are

looking for and their abilities to interact are not homogeneous, generalised assumptions

are to be made about their needs and suitable queries made available.

7/24/2019 10746328




87

5. CONVERSION OF DATA TO INFORMATION:

Once the sources and users are identified, methods of converting raw data into

useful information are to be explored. Needless to say, this is the key to the success of

the warehouse.

Normal methods employed are

i. Data Conversion

ii. Data Scrubbing

iii. Data Transformation

iv. Web Publishing

i. Data Conversion: The different inputs to the warehouse come from various data

capture systems - online, disks, tapes etc. such information coming from different

OLTP systems need to be accepted, converted into suitable formats before loading

on to the warehouse (called the core object Repositery). Standard software like

Oracle SQL loaders can do the job.

Once the data becomes available in tape, floppy or any other input form,

the warehouse manager checks for their authenticity, then executes the routines to

store them in the warehouse memory. He may even take printouts of the same.

Barring the warehouse manager, other users/Custodians are not allowed to modify

the data in the warehouse. They can only send the data to be updated to the

manager, who will do the necessary updatings. Typically, the data from the

warehouse becomes unavailable to all or a set of users during such updatings.

ii. Data Scrubbing: Data Scrubbing is the process of checking the validity of the

data, arriving at the data warehouse from different sources to ensure it's quality

and accuracy as well as it's completeness. Since the data originates from different

sources, it is possible that some of the key data may be ambiguous, incomplete or

missing altogether. Further, the data keep arriving at periodical intervals, it's

consistency with respect to the previously stored data is not always guaranteed.

Such originally invalid data, needless to say, loads to false comparisons.

7/24/2019 10746328




88

Further, over a period of time, simple inconsistencies like misspelt names,

missing fields, inconsistent data (like placename and PIN) may accrue. No single

method is available for dealing with all such short comings. Several algorithms,

ranging from simple to fairly complex ones, are used to filter out such

inconsistencies. In extreme cases, the sources of data may have to be requested

for resubmission.

iii. Data Transformation: This process involves the extraction of data from

information repository, scrubbing it and loading it into the main database. The

process includes identifying the dimensionalities, store the data in appropriate

formats and may also involve indicating the users that the data is ready for use.

iv) Web Publishing: This becomes important if the warehouse is to be web

enabled. The web agent on the server, which interacts with the HTML templates, reads

the data from the server and sends it on the web page. The agent, of course, has to

resolve the access rights of the user before populating the information on the web

page. This becomes extremely important, when, for example, citizens are allowed to

access certain section of information, while many others are to be made inaccessible.

The system administrator is expected to tackle the various issues regarding such

selective access rights by suitable configuring the server.

6. IDENTI FYING OF THE HARDWARE AND SOFTWARE COMPONENTS:

While no specific guidelines regarding the hardware components can be made, it

is desirable to store the detain from external data sources separately, at least during the

data conversion stage. In the present case, the two data sources, the multipurpose

household survey (MPHS) and the land data extracted from land records can be stored on

two separate sets of storage devices.

The data, after the scrubbing operation is normally stored on a RDBMS, like

Oracle or Sybase. Usually these are relational databases, from which the

multidimensional database server (MDDB) receiver data. In the present case Oracle8

Enterprise Edition was deployed, because it supports both relational and object relational

models.

7/24/2019 10746328




89

The next important component is the multidimensional database server (MDDB).

This is a specialised data storage facility to store summarised data for fast access. Since,

unlike the two dimensional relational DBMS, this operates on a multidimensional logical

perspective, traversal along one/more of these dimensions either successively or in

parallel becomes easier and also faster.

Of course the choice of number of dimensions and their actual relationships forms

an important design strategy. Too few dimensions can make the operational efficiency

similar to that of simple relational models, where as choice of too many dimensions

would make the operation complex (Since physically, the server and RDBMS still work

as two diemnsional operators, the multidimensional operation being only a logical

extention).

The concept can be increased to several levels. A given dimension can be a

simple one or itself can be made up of several dimensions.

For example the concept of person can be made up of dimensions of sex,

Agegroup, Occupation, Caste and Income. The dimension age group may have

dimensions along the actual age, income of dimensions like assured income and non-

assured income etc. The consolidation of data into several dimensions is a tricky job.

Often data is called at the lowest level and is aggregated into higher level totals for the

sake of analysis.

Since the data is to be accessed on the web, a web server of suitable capacity is of

prime importance. The web server receives query requests from the web, converts it into

suitable queries, hands it to the MDDB server and the replies from the MDDB server are

sent back to the web, to be displayed to the person who has raised the query request.

The query itself can be raised either by i) The clients which are computers

connected physically to the servers or ii) web clients, where the user would require the

replies over the Internet. The government may also provide "Kisosks", special terminals,

where users can get the required information by the 'touch screen' technology.

7/24/2019 10746328




90

7. TYPE OF EXPECTED QUERIES

The usually encountered type of queries are to be listed out next. This can easily be

done by interacting with the potential users and noting out their expectation out of the

warehouse.

For example, in the present case, the person object may be used to answer questions

1. Relationship amongst members

2. Educational levels of persons

3. No. of persons above poverty line

4. Average income per household

5. Percentage of persons owning houses etc. etc.

Similarly on the land object, questions normally asked are

1. Land under cultivation

2. Percentage of irrigated land

3. Percentage of crops in the land

4. Average land holdings of person

5. Yield seasonwise etc. etc.

While the users like the planners are likely to comeout with special and newer

queries, the average citizens often end up asking similar questions. This, apart

from the fact that many of them may not be computer savvy, makes a case for

producting several "Canned Query modules". I.e. the user has no option of

formulating his own queries, but can choose to get answers for one/more of the

readymade questions. Such questions can be on the "Kiosks", and the user gets

the answer by choosing them by using a suitable pointer device.

At the next level, the user may be provided with a "Custom Query

Model", which helps him to formulate queries and get the answers. It may also

help the user to change certain parameters and get suitable results that help in

formulating policies. Further, such custom queries may be either summary ones

or detailed. The latter help the users in microlevel analysis of information.

7/24/2019 10746328




91

8. CHOICE OF DATA STRUCTURES AND DIMENSIONS

The next stage is the choice of data structures and dimensions. Dimensions are a

series of values that provide information at various levels in the hierarchy. For

example in this particular case, the item person has been given 20 dimensions and

their dimension numbers are listed below

1. Occupation (D1)

2. Age (D2)

3. Sex (D3)

4. Caste (D4)

5. Religion (D5)

6.

Shelter (D6)

7. SSID (D7)

8. House (D8)

9. Khata number (D9)

10. Crop (D10)

11. Season (D11)

12. Nature (D12)

13. Irrigation (D13)

14. Classificat ion (D14)

15. Serial Number (D15)

16. Land (D16)

17. Area (D17)

18. Time (D18)

19. Occupant (D19)

20. Marital Status (D20)

Of course there was no reason for this particular ordering the dimensions, and any

other order of dimensions would have been equally liable. Note that each of these

dimensions can be considered to be at level 1, but they can have lower level values at

level2, level3 etc.

7/24/2019 10746328




92

For example

Level 1

Level 2

Now if someone is searching for a person with a specific occupation (say

occupation2), then one will search along the occupation dimension (D1) at level 2 and so

on.

Now consider another case, the case of castes

Level 1

Level2

Level 3

Now all castes is along dimension D4. If one were to need some details about all

castes, then he will search along D4, level1. If some details about backward castes is

needed, one goes along D4 level 2. In case of some particular say caste 3, the search will

be along level-3, of D4.

All Occupations

Occupation Occupation

All Casts

Forward Backward Scheduled

Caste Caste Caste Caste Caste Caste

7/24/2019 10746328




93

Take the case of area D17.

Look at the hierarchy.

Now any search along D17, for specific taluks will proceed along level 3, for a specific

district at level 2 etc.

Once the above structures are frozen, the analysis becomes simple. Most data

base packages provide I) specific queries to search along specific levels of a d imensions

in a truly multidimensional database and ii) in a simple relational database, the

multidimensions need to be searched as a relation at appropriate level.

Since this analysis part is software specific, they do not come under the purview

of this specific case study, but it suffices to say that any general query can be broken into

a sequence of search commands at the appropriate levels.

Canned query modules simply are a list of such sequencial query combinations,

each combination answering a particular ' canned query' and identified, possibly by a

number. Once the number is selected, the sequence of searches is made and the results

displayed.

In the other case of "custom query modules", the GUI helps the user to convert his

queries into a sequence of system queries, so that they can be implemented.

While the above discussion provides a basic structure for the implementation,

several details like handling of historic data, providing time-dimensioned reports etc.

State

District 1 District 2

Taluk 1 Taluk 2 Taluk 3 Taluk 4

Village 1 Village 2

Level 1

Level 2

Level 3

Level 4

7/24/2019 10746328



have been hidden. But, it suffices to say, those details will be addons to the bas ic

analysis package.

9. TERM EXERCISE.

Suggest a suitable data warehouse design to maintain the various details of your

college. While the actual query formations are not very important, the study of the

various system requirements need to be worked out in detail, and presented in a step by

step manner.

Date post:	23-Feb-2018
Category:	Documents
Upload:	farrukhsharifzada
View:	217 times
Download:	0 times

10746328

Documents