Date post: | 06-Jan-2018 |
Category: |
Documents |
Upload: | betty-harris |
View: | 216 times |
Download: | 0 times |
CIS 465 - Data Warehousing 1
Data WarehousingVersion 6.0 - 04/18/2000
CIS 465 - Data Warehousing 2
Can your database answer questions like these?
• What is the cost of staff to break into a new line of business?
• What are the travel routes of my competition’s inventory?
• At what velocity is my competitor moving toward a common goal?
• How will a transaction on a certain date be affected by currency exchange rates?
• Is a foreign labor source likely to produce a higher quality product?
• Which 20% of the problem creates 80% of the problems?
CIS 465 - Data Warehousing 3
Can your database answer questions like these?
• By product and location, how can we regain a lost customer base?
• Which skill and staff levels are most likely to accept the voluntary layoff package?
CIS 465 - Data Warehousing 4
Data is Difficult to Manage
• Amount of data increases exponentially; past data must be kept for long times; new data are added rapidly.
• Data are scattered throughout organizations; collected by many individuals; using different methods and devices.
• Only small portions of an organization’s data are relevant for specific decisions.
• An ever-increasing amount of external data needs to be considered in making organizational decisions.
CIS 465 - Data Warehousing 5
Data is Difficult to Manage - contd.
• Raw data may be stored in different computing systems, formats, and human and computer languages.
• Legal requirements relating to data differ among countries and change frequently.
• Selecting data management tools can be difficult because of the large number of tools available.
• Data security, quality, and integrity are critical, yet easily jeopardized.
CIS 465 - Data Warehousing 6
Data and Knowledge Managementment
• Businesses do not run on data. They run on information and their knowledge of how to put that information to use successfully.
CIS 465 - Data Warehousing 7
Some Information Concepts
• Data: Unorganized facts and figures. (raw material)
• Information: Data that has been processed into a form that is meaningful to the recipient and is of real of perceived value in current or prospective actions or decisions.
• Information:– adds to a representation– corrects or confirms previous information– has “surprise” value in that it tells us something we
did not know, or could not predict.– What is a “finished product” to one, may be “raw
materials” to someone else.
CIS 465 - Data Warehousing 8
Definitions: Information vs. Knowledge
• Knowledge: a combination of instincts, ideas, rules, and procedures that guide actions and decisions.
• Helping to provide the best available knowledge to decision-making is another role of information systems
CIS 465 - Data Warehousing 9
Relationship Between Data, Information, and
Knowledge• The difference between data and information is
easy to remember.• It is often cited as the reason why systems that
collect large amounts of information fail to meet management’s information needs.
• There are many methods of converting data into information for decision making.
• Managers take action based on information about a current situation plus their accumulated knowledge. Actions taken feed the process of accumulating more knowledge (experience).
• Example: How do medical students become competent physicians?
CIS 465 - Data Warehousing 10
Relationship Between Data, Information, and
Knowledge
CIS 465 - Data Warehousing 11
Attributes of Quality Information
• Timeliness• Completeness• Conciseness• Relevance• Accuracy• Precision• Appropriateness of Form
CIS 465 - Data Warehousing 12
Special Characteristics of Information
• Usefulness - depends on combination of quality,accessibility,and presentation.
• One person’s information may be another person’s noise.
• Soft data may be as important as hard data.• Ownership of information may be hard to
maintain.• More information is not always better
(information overload).• Politics can often hide or distort
information.
CIS 465 - Data Warehousing 13
Sources of Data
• Internal Data– Data about people, products, services, processes.– Often stored in corporate data bases (e.g. Sales or
HR).– Some data may be disparate in different regions,
but accessible by networks.• Personal Data
– Individuals document expertise by creating personal data - subjective estimates. Some is kept in heads, or mental models.
– Can be store on PCs, or available on the Web.• External Data
– Many sources.
CIS 465 - Data Warehousing 14
Some Sources of Business External Data
• Federal Publications– Survey of Current
Business– Monthly Labor Review– Federal Reserve Bulletin– Employment and Earnings– Commerce Business Daily– Census Bureau
• Other– International Monetary
Fund– Moody’s– Standard & Poor’s– Advertising Age– Dialog and Lexis/Nexis
• Other-contd.– ABI/Inform– Annual Editor &
Publisher Market Guide
– Thomas Register On-line
• Indexes – Encyclopedia of
Business Information Sources
CIS 465 - Data Warehousing 15
What is a data warehouse?
• A data warehouse is a pool of data organized in a format that enables users to interpret data and convert it into useful information to gain knowledge from this interpretation.
• It is a single place that contains complete and consistent data from multiple sources.
• Data warehousing is the act of a business person extracting business value from the data stored in the data warehouse.
CIS 465 - Data Warehousing 16
Collecting Raw Data
• Is not Easy– collect in the field– elicit from people– collect manually, electronically, or by sensors.
• Data collection technology has not kept pace with advances of data storage technology.
• Data collection from external sources is not easy either.
• Bottom Line: Garbage IN, Garbage OUT - GIGO.
• Data Quality is an Important Issue.
CIS 465 - Data Warehousing 17
Data Quality Issues
• Intrinsic DQ:– accuracy, objectivity, believability, reputation.
• Accessibility DQ:– Accessibility and access security
• Contextual DQ– relevancy, value-added,
timeliness,completeness, amount of data.• Representation DQ:
– Interpretability, ease of understanding,concise representation, consistent representation
CIS 465 - Data Warehousing 18
Why Data Warehousing?
• Managers do not make decisions that are “good” or “bad”, they make decisions on the basis of good or bad information.
• Management information =– a. the right information– b. in the right form– c. at the right time.”
• Most transaction-based information systems have difficulty delivering this information.
CIS 465 - Data Warehousing 19
Why Data Warehousing - 2
• Not the right information:– data not easily accessible– meaning is subtly (or significantly
different) from the question context.
– Information is presented with too much or too little detail, covers the wrong time spans, or is in the wrong intervals.
CIS 465 - Data Warehousing 20
Why Data Warehousing - 3
• Not the right time:– Getting this information may
require the efforts of highly skilled professionals who are not generally available at the whim of business managers.
– Data comes from a variety of different systems which are resident on a variety of different technology platforms.
CIS 465 - Data Warehousing 21
Why Data Warehousing - 4
• Not the right format:– If data is extracted, merged, and
converted into a meaningful information, often it is not in a usable format.
– Users will want it loadable into a particular PC tool or spreadsheet with which he/she is familiar.
– Printouts weighing 10 pounds are not in the right format.
– a diskette with a COBOL file description is not in the right format.
CIS 465 - Data Warehousing 22
The Dilemma for Corporate IT
• How to control scarce IT resources consumed by insatiable user demand for ad-hoc reports.
• Each ad-hoc report generated by IT and analyzed by the user generates three more reports to further illuminate the insights gleamed in the first.
• Often the extract programs have few reusable components.
• The user is on a “voyage of discovery in a sea of data”.
CIS 465 - Data Warehousing 23
The Response of Corporate IT
• New methodologies: Align the IT systems with the business goals and requirements.
• These techniques concentrate on business process requirements, not decision support requirements.
• Transaction systems must be rigorously specified in advance. The are an intersection between the organization and the customer.
• These systems should not be a “voyage of discovery” for either.
CIS 465 - Data Warehousing 24
Transaction Systems vs. Analytical Support
Systems• Transaction Systems:
– Insert an order for 300 baseballs– Update this passenger’s airline reservation.– close-out accounts payable records for this vendor.– What is the current checking account balance for
this customer?• Analytical Support Systems:
– Did the sales promotion last quarter do better than the same promotion last year?
– Is the five-day moving average for this security leading or trailing actual prices?
– Which product line sells best in middle-America and how does this correlate to demographic data.
CIS 465 - Data Warehousing 25
Analytical Processing
• Analytical Processing today includes what in the past have been called:
– DSS (Decision Support Systems)– EIS (Executive Information Systems)– ESS (Executive Support Systems)
• It is an evolution of “End-User Computing”
• Placing strategic data access in the hands of decision makers aids productivity and enables them to be better decision makers.
CIS 465 - Data Warehousing 26
Key Difference: OLTP vs. OLAP
• OLTP: Processing specific functions• OLAP: providing flexibility for
undetermined analysis.
A Multidimensional database
CIS 465 - Data Warehousing 28
Data for Decision Support
• The data must be integrated - requires data from many separate internal corporate databases.
• The data must be enriched - through integration with other external data.
• The data must be available - and not constrained by machine resources.
CIS 465 - Data Warehousing 29
Sources of Data
• Internal Data:– Financial Systems– Logistics Systems– Sales Systems– Production Systems– Personnel Systems– Billing Systems– Information Systems
• External Data Needs:– to recognize opportunities– to detect threats– to identify synergies
CIS 465 - Data Warehousing 30
Sources of Data - 2
• External Data Categories– Competitor Data– Economic Data– Industry Data– Credit Data– Commodity Data– Econometric Data– Psychometric Data– Meteorological Data– Demographic Data– Sales & Marketing Data
CIS 465 - Data Warehousing 31
Operational Control vs. Operational Strategy
• Data is a source not just of operational control, but of operational strategy.
• Operational strategy is an attempt to describe the need, in a competitive and turbulent market, to continually innovate and re-align strategy with time scales too short to be comprehended by strategic planning in the conventional corporate sense.
CIS 465 - Data Warehousing 32
Comparison of Control and Strategy Data:
• Operational Data:– short-lived, rapidly changing– requires record-level access– repetitive standard transactions and access patterns– updated in real-time– event-driven; process generates data
• Strategic Data:– long-living, static– data aggregated into sets (which is why warehouse
data is friendly to RDBMS).– ad-hoc queries with some periodic reporting– updated periodically with mass loads– data-driven; data governs process.
CIS 465 - Data Warehousing 33
Information Requirements by Management Level
(Source: Gorry and Scott Morton)Characteristicsof Information
OperationalControl
ManagementControl
StrategicPlanning
Source Largely Internal External
Scope Well defined,narrow
Very wide
Level ofAggregation
Detailed Aggregate
Time Horizon Historical Future
Currency Highly current Quite Old
RequiredAccuracy
High Low
Frequency of Use Very frequent Infrequent
CIS 465 - Data Warehousing 34
Dimensional Modeling
• Dimensional Modeling gives us a way to visualize data.
• The CEO’s perspective: – “We sell products in various markets,
and we measure our performance over time.”
• From the data warehouse designer’s perspective, we hear three dimensions:– We sell Products– in various Markets– and measure performance over time.
CIS 465 - Data Warehousing 35
Dimensional Modeling - contd.
• Management may be interested in examining sales figures in a certain city by product, by time period, by salesperson, and by store.
• The more dimensions involved, the more difficult it is to represent in a single table or graph.
• The ability to add and modify the dimensions used in a table or graph is often known as “slicing and dicing” the data.
CIS 465 - Data Warehousing 36
Dimensional Model of the Business
Product
Market
Ti
me
CIS 465 - Data Warehousing 37
SalesRep
SalesDistrict
SalesRegion
SalesDivision
ProductGroup
ProductLine
Product
Contact
Ship ToShipperShipType
DistrictCredit
Contract CustomerContractType
SalesOrder
CustomerLocation
ContactLocation
OrderItem
Data Dependencies Model of a Business
CIS 465 - Data Warehousing 38
Transaction Processing
• The Relation Model was full of promises for equal access to data.
• In the early 1980’s the relational model was a dream. Typical transaction rates were one per second.
• Today the SABRE system typically processes 4,000 transactions per second, with peak bursts of 13,000 per second.
• OLTP - (On-line Transaction Processing) The point is to get data “in” to the database.
CIS 465 - Data Warehousing 39
Segregating Operational and Warehouse Data
• In the past, data administrators were constantly told to build data sharing, normalization, and non-redundant corporate databases.
• Early attempts at data warehousing tried to share the data with transaction-based systems. This resulted in LONG response times for complex queries.
• The idea today is to keep the two separate.• Separate databases, and perhaps separate
DBMS products and processor platforms are used.
• Controlled and practical redundancy is better than out-of-control theoretical purity.
CIS 465 - Data Warehousing 40
Fundamental Obstacles With Traditional Systems
• Systems Integration - Disintegration grew slowly from islands of automation.– ownership, planning, economic,
organizational development issues all contribute.
• Hardware Architecture• Inconsistent Data• Data Pollution:
– Bad Application Design (semantic and syntactical differences).
– Ownership– Data Entry Conventions
CIS 465 - Data Warehousing 41
The Data Warehouse• Active, tactical, and current events flow from
the operational systems to the data warehouse to become static, strategic, and historical data.
• The data warehouse becomes a “middle ground” where a large number of disparate and incompatible “legacy systems” are tied to an equally diverse collection of end-user workstations.
• Legacy systems usually comprise a hodge-podge of assorted hardware, software, and operational systems accumulated over many decades, are by nature, incompatible with one another and unique to each organization.
CIS 465 - Data Warehousing 42
Practical Facts About the Warehouse
• The chances are remote that any single vendor will be able to develop a product that can interface with all “legacy systems” painlessly and “seamlessly” and at the same time, combine data from modern platforms and data external to the organization.
• Instead warehouse product vendors develop specialized capabilities to work with various environments.
CIS 465 - Data Warehousing 43
Typical Dimensional Model
Time_keyday-of-weekmonthquarteryearholiday_flag
Time_keyproduct_keystore_keydollars_soldunits_solddollars_cost
Product_keydescriptionbrandcategory
Store_keystore_nameaddressfloor_plan_type
Sales FactTime Dimension
Product Dimension
Store Dimension
CIS 465 - Data Warehousing 44
Fact Table
• Fact Table is where numerical measurements of the business are stored.
• Each measurement is taken at the intersection of all the dimensions.
• The “best” facts are numeric, continuously valued and additive.
• For every query made against the fact table may use hundreds of thousands of individual records to construct an answer set.
CIS 465 - Data Warehousing 45
Dimension Tables
• Dimension tables are where textual descriptions of the business are stored.
• Each textual description helps to describe a member of the dimension.
• Example: each member in the product dimension is a specific product. The product dimension database has many attributes to describe the product. A key role of the dimension table attribute is to serve as the source of constraints in a query.
CIS 465 - Data Warehousing 46
Example
BrandAxonFramisWidgetZapper
Dollar Sales780104421395
Unit Sales26350944439
CIS 465 - Data Warehousing 47
Example Query
• Find all product brands that were sold in the first quarter of 1995 and present the total dollar sales as well as the number of units.
• Brand is a collection of individual products.
• To construct:– A. Drag attribute brand from product dimension.
Place as Row Header.– B. Drag Dollar Sales and Units Sold from the Fact
Table, and place to the right of the Brand row header.– C. Specify row constraint “1st Q 1995” on the
quarter attribute in the Time Dimension Table.
CIS 465 - Data Warehousing 48
Multidimensionality
• Examples of dimensions– products, salespeople, market segments,
business units, geographical locations, distribution channels, country, industry
• Examples of Facts or Measures:– money, sales volume, head count, inventory,
profit, actual vs. forecast.• Examples of Time:
– daily, weekly, monthly, quarterly, yearly
CIS 465 - Data Warehousing 49
Components of a Data Warehouse - 1
• Acquisition - The first component handles acquisition of data from legacy systems and outside sources.
• Data is identified, copied, formatted, and prepared for loading into a warehouse.
• Vendors provide tools for extraction and preparation.
CIS 465 - Data Warehousing 50
Components of a Data Warehouse - 2
• Storage Area - The second component is the storage area managed by relational databases, multi-dimensional databases, specialized hardware - symmetric multiprocessor (SMP) or massively parallel processors (MPP) machines - or by software.
• The storage component hold the data so that many different data mining, executive information and decision support systems can make use of it effectively.
CIS 465 - Data Warehousing 51
Components of a Data Warehouse - 3
• Access - The third component of the warehouse is the access area.
• Different end-user PCs and workstations draw data from the warehouse with the help of multi-dimensional analysis tools, neural networks, data discover tools, or analysis tools.
• These “smart” data-mining tools are the driving force behind the data warehouse concept.
• What good is it to store all the information without some way to understand it in new and different ways.
CIS 465 - Data Warehousing 52
Data Warehouse Access Tools
• Intelligent Agents and Agencies - tools work and think for user.
• Query Facilities and Managed Query environments.
• Statistical Analysis - One of the biggest surprises in the data warehousing marketplace is the resurgence of interest in traditional statistical analysis, and the concomitant resurrection of the popularity to products like SAS and SPSS.
CIS 465 - Data Warehousing 53
Data Warehouse Access Tools - 2
• Data Discovery - A large class of tools formerly classified as decision support, artificial intelligence and expert systems. They now make use of neural networks, fuzzy logic, decision trees, and other tools from advanced mathematics to allow a user to “sift” through massive amounts of raw data to “discover” new, interesting, insightful, and in many cases useful things about the organization, its operations, and its markets.
• Currently there are nearly 60 different data discovery tools/products on the market.
CIS 465 - Data Warehousing 54
Data Warehouse Access Tools - 3
• OLAP - On-line Analytical Processing often uses multi-dimensional spreadsheet tools allowing users to look at information from many different angles.
• Users are able to “slice and dice” reports and to look at the same kinds of information at different levels at the same time.
• Typical OLAP application might allow a product manager to view sales figures for a given product at the national level, see them broken down by division, drill down to see territories within a division, check sales numbers for each store within a territory, and then compare them against sales of stores from another territory.
CIS 465 - Data Warehousing 55
Data Warehouse Access Tools - 4
• Data Visualization - These tools turn ugly, boring numbers into exciting visual presentations.
• These tools bring graphical representation to new heights. Example: Geographical information systems turn data about stores, individuals, or anything else into compelling, easy to understand, dynamic maps.
• PC-based Geographic Information systems have the ability to display spatial occurrences and the relationship between and among geographically specific variables.
CIS 465 - Data Warehousing 56
Developing the Data Warehouse
• The most expensive warehousing ventures involve major new hardware acquisitions and significant investments in training, analysis, and systems development costs.
• Typical startup projects allocate 60% of budget for hardware and software for creation of a powerful storage component. 30% on data mining and acquisition tools.
• Budgeting for Systems Analysis and Development has 50% of budget on acquisition capabilities, 30% fund user solutions, 20% creation of databases in the storage component.
CIS 465 - Data Warehousing 57
Developing the Data Warehouse
• Clarify what you want to do with the Warehouse - How Will It be Used.
• Scrutinize the offerings of vendors and systems integrators. Make sure you understand which functions they provide, and which you must build.
• Most successful projects start as small, tightly defined tactical systems to solve pressing business needs, and develop into larger systems over time.
CIS 465 - Data Warehousing 58
DW Summary: Key Concepts
• The DW is a “collection of integrated, subject-oriented databases designed to support the decision support function where each unit of data is non-volatile and relevant to some moment in time: (W.H. Inmon, 1992).
• Implicit Assumptions:– physically separate from operational systems– hold aggregated data and transactional
(atomic) data for management separate from those used for OLTP.
CIS 465 - Data Warehousing 59
DW Summary: Characteristics
• Subject-orientation• integrated• non-volatile (i.e. not updated)• time variant (kept for long periods, for
forecasting and trend analysis)• summarized• large volume• not normalized• metadata• data sources
CIS 465 - Data Warehousing 60
Data Mining