Data mining Introduction

Post on 04-Nov-2014

484 views 0 download

Tags:

description

Data mining and data warehousing

transcript

year Evolution of data mining and warehousing

1960’s Data collection and database creation

1970’s Database Management systems

Mid 1980’s Advanced database systems

Late 1980’s Data warehousing and Data mining

1990’s Web Based Databases

2006 Information Systems

2013 Big data retrieval

Data Mining refers to extracting or “mining” knowledge from large amounts of data

Knowledge mining from data

Knowledge Extraction Data/Pattern analysis Data archaelogy Data Dredging Knowledge discovery from

data.

Knowledge Discovery Process:

Data cleaning Data integration Data selection Data transformation Data mining Pattern evaluation Knowledge presentation

Relational databases Data Warehouses Transactional Databases Object Relational Databases Temporal, Sequence and Time series

Databases Spatial and Spatio Temporal Databases Text and Multimedia Databases Heterogeneous and Legacy Databases Data Streams and WWW

1.Relational database

A set of variables A set of messages A set of methods

A temporal database typically stores relational data that include time-related attributes.

These attributes may involve several timestamps, each having different semantics.

A sequence database stores sequences of ordered events, with or without a concrete notion of time.

Examples include customer shopping sequences,Web click streams, and

biological sequences.

A time-series database stores sequences of values or events obtained over repeated measurements of time (e.g., hourly, daily, weekly).

Examples include data collected from the stock xchange, inventory control, and the observation of natural phenomena (like temperature and wind).

Data Warehouse A data warehouse is a subject-

oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process

geographic (map) databases, very large-scale integration (VLSI) or computed-

aided design databases, medical and satellite image databases. Spatial data may be represented in raster

format: n-dimensional bit maps or pixel maps.

For example, a 2-D satellite each pixel registers the rainfall in a givenarea.

Maps can be represented in vector format, where roads, bridges, buildings, and

lakes are represented as unions or overlays of basic geometric constructs, such as points,

lines, polygons, and the partitions and networks formed by these components.

A spatial database that stores spatial objects that change with time is called a

spatiotemporal database,e.g., Cricket Ball

Text databases are databases that contain word descriptions for objects.

Multimedia databases store image, audio, and video data.

A heterogeneous database consists of a set of interconnected, autonomous component databases.

A legacy database is a group of heterogeneous databases that combines different kinds of data systems, such as relational or object-oriented databases,hierarchical databases, network databases, spreadsheets, multimedia databases, or file systems.

data flow in and out of an observation platform (or window) dynamically is generated and analyzed.

Capturing user access patterns in such distributed information environments is called Web usage mining (or Weblog mining).

› Time Variant

The Warehouse data represent the flow of data through time. It can even contain projected data.

› Non-Volatile

Once data enter the Data Warehouse, they are never removed.

The Data Warehouse is always growing

Teradata Oracle SAP BW - Business Information

Warehouse (SAP Netweaver BI) Microsoft SQL Server IBM DB2 (Infosphere Warehouse) SAS

1984 — Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases Data Interpretation System (DIS).

DIS was a hardware/software package and GUI for business users to create a database management and analytic system.

Survey (S): (2 Minutes)The students are asked to browse the

following titles and subtitles from the book.

Text Book:Han and Kamber, “Data Mining”, Second

Edition, Elsevier,2008. Page no:105-109 Page no : 2-21

1.Data Mining is otherwise called as a) Knowledge miningb) Knowledge mining from large datac) Data extractiond) None of the above2.In knowledge Discovery process,data mining is after which processa) Data transformationb) Data selectionc) Neither (a) nor (b)d) Both3. In which type of data warehouse, once the data enter the Data

Warehouse, they are never removed.a) Integrated b) Time-variantc) Subject orientedd) Non-Volatile

4. An object relational database consists of entities with

a) Variables b) Messagesc) Methods d) All the above5.Web usage mining is otherwise called as Weba) Web miningb) Web log miningc) None of the aboved) Both

Specify the seven steps in KDD process? Explain four categories of data

warehousing? Define heterogenous and legacy

database? What are the data mining task

primitives? What are the different kinds of data to

be mined?

A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to:

Congregate data from multiple sources into a single database so a single query engine can be used to present data.

Mitigate the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large, long running, analysis queries in transaction processing databases.

Maintain data history, even if the source transaction systems do not.

Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.

Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.

Present the organization's information consistently. Provide a single common data model for all data of

interest regardless of the data's source. Restructure the data so that it makes sense to the

business users. Restructure the data so that it delivers excellent

query performance, even for complex analytic queries, without impacting the operational systems.

Add value to operational business applications, notably customer relationship management (CRM) systems.