White Paper - Process Neutral Data Modelling

WHITE PAPER

Process Neutral Data Modelling

DAVID M WALKER

Version: 1.0 Date: 10/02/2009

Data Management & Warehousing

138 Finchampstead Road, Wokingham, Berkshire, RG41 2NU, United Kingdom

http://www.datamgmt.com

Data Management & Warehousing

White Paper - Process Neutral Data Modelling

© 2009 Data Management & Warehousing

Page 2

Table of Contents Table of Contents ...................................................................................................................... 2 Synopsis .................................................................................................................................... 4 Intended Audience .................................................................................................................... 4 About Data Management & Warehousing................................................................................. 4 Introduction................................................................................................................................ 5 The Problem.............................................................................................................................. 6

The Example Company......................................................................................................... 6 The Real World ..................................................................................................................... 9

The Customer Paradigm ......................................................................................................... 10 Requirements of a Data Warehouse Data Model.................................................................... 12

Assumptions........................................................................................................................ 12 Requirements...................................................................................................................... 12

The Data Model ....................................................................................................................... 14 Major Entities ...................................................................................................................... 14 Type Tables ........................................................................................................................ 17 Band Tables ........................................................................................................................ 19 Property Tables................................................................................................................... 20 Event Tables ....................................................................................................................... 22 Link Tables.......................................................................................................................... 23 Segment Tables .................................................................................................................. 24

The Sub-Model ........................................................................................................................ 25 History Tables ..................................................................................................................... 26 Occurrences and Transactions ........................................................................................... 27

Implementation Issues ............................................................................................................ 33 The ‘Party’ Special Case..................................................................................................... 33 Partitioning .......................................................................................................................... 35 Data Cleansing.................................................................................................................... 36 Null Values .......................................................................................................................... 36 Indexing Strategy ................................................................................................................ 36 Enforcing Referential Integrity............................................................................................. 36 Data Insert versus Data Update.......................................................................................... 37 Row versus Set Based Loading in ETL............................................................................... 37 Disk Space Utilisation ......................................................................................................... 38 Implementation Effort .......................................................................................................... 38

Data Commutativity ................................................................................................................. 39 Data Model Explosion and Compression ................................................................................ 40

How big does the data model get?...................................................................................... 40 Can the data model be compressed? ................................................................................. 40

Which Results to Store?.......................................................................................................... 41 The Holistic Approach ............................................................................................................. 42 Summary ................................................................................................................................. 43 Appendix 1 – Data Modelling Standards ................................................................................. 44

General Conventions .......................................................................................................... 44 Table Conventions .............................................................................................................. 44 Column Conventions........................................................................................................... 46 Index Conventions .............................................................................................................. 50 Standard Table Constructs ................................................................................................. 50 Sequence Numbers For Primary Keys................................................................................ 52

Appendix 2 – Understanding Hierarchies................................................................................ 53 Sales Regions ..................................................................................................................... 53 Internal Organisation Structure ........................................................................................... 53

Appendix 3 – Industry Standard Data Models ......................................................................... 55 Appendix 4 – Information Sparsity .......................................................................................... 57 Appendix 5 – Set Processing Techniques............................................................................... 59 Appendix 6 – Standing on the shoulders of giants .................................................................. 60



Page 3

Further Reading ...................................................................................................................... 61 Overview Architecture for Enterprise Data Warehouses..................................................... 61 Data Warehouse Governance............................................................................................. 61 Data Warehouse Project Management ............................................................................... 62 Data Warehouse Documentation Roadmap ....................................................................... 62 How Data Works ................................................................................................................. 63

List of Figures.......................................................................................................................... 64 Copyright ................................................................................................................................. 64



Page 4

Synopsis This paper describes in detail the process for creating an enterprise data warehouse physical data model that is less susceptible to change. Change is one of the largest on-going costs in a data warehouse and therefore reducing change reduces the total cost of ownership of the system. This is achieved by removing business process specific data and concentrating on core business information. The white paper examines why data-modelling style is important and how issues arise when using a data model for reporting. It discusses a number of techniques and proposes a specific solution. The techniques should be considered when building a data warehouse solution even when an organisation decides against using the specific solution. This paper is intended for a technical audience and project managers involved with the technical aspects of a data warehouse project.

Intended Audience Reader Recommended Reading Executive Synopsis Business Users Synopsis IT Management Synopsis IT Strategy Entire Document IT Project Management Entire Document IT Developers Entire Document

About Data Management & Warehousing Data Management & Warehousing is a specialist consultancy in data warehousing, based in Wokingham, Berkshire in the United Kingdom. Founded in 1995 by David M Walker, our consultants have worked for major corporations around the world including the US, Europe, Africa and the Middle East. Our clients are invariably large organisations with a pressing need for business intelligence. We have worked in many industry sectors but have specialists in Telco’s, manufacturing, retail, financial and transport as well as technical expertise in many of the leading technologies. For further information visit our website at: http://www.datamgmt.com Crossword Clue: Expert Gives Us Real Understanding (4 letters)



Page 5

Introduction

Commissioning a data warehouse system is a major undertaking. Organisations will invest significant capital in the development of the system. The data model is always a major consideration and many projects will invest a significant part of the budget on developing and re-working the initial data model. Unfortunately projects also often fail to look at the maintenance costs of the data model that they develop. A data model that is fit for purpose when developed will rapidly become an expensive overhead if it needs to change when the source systems change. The cost involved is not only in the change to the data model but also in the changes to the ETL that feed the data model. This problem is exacerbated by the fact that changes to the data model may be done in an inconsistent way from the original design approach. The data model loses transparency and becomes even more difficult to maintain. For many large data warehouse solutions it is not uncommon to have a resource permanently assigned to maintaining the data model and several more resources assigned to managing the change in the associated ETL within a short time of going live. By understanding the problem and using techniques imported from other areas of systems and software development and well as change management techniques it is possible to define a method that will greatly reduce this overhead. This white paper sets out an example of the issues from which to develop a statement of requirements for the data model and then demonstrates a number of techniques which, when used together, can address those requirements in a sustainable way.



Page 6

The Problem Data modelling is the process of defining the database structures in which to hold information. To understand the Process Neutral Data Modelling approach first this paper looks at why these database structures have such an impact on the data warehouse. In order to demonstrate the issues with creating a data model for a data warehouse more experienced readers are asked bear with the necessarily simplistic examples that follow.

The Example Company A company supplies and installs widgets. There are a number of different widget types, each having a name and specific colour. Each individual widget has a unique serial number and can have a number of red lamps and a number of green lamps plugged into it. The widgets are installed into cabinets at customer sites and from time to time engineers come in and change the relative numbers of red and green lamps. The customer name and a customer cabinet number identify cabinets. For operational systems the data model might look something like this1:

Figure 1 - Initial Operational System Data Model2

This simple data model describes both the widget and the cabinet and provides the current combinations. It does not provide any historical context: “What was the previous configuration and when was it changed?” Historical data can be recorded by simply adding start date and end date to each of the main tables. This provides the ability to report on the historical configuration3. In order to facilitate this a separate reporting environment would be setup because retaining history in the operational system would unacceptably reduce the operational system performance. There are three consequences of doing this:

• Queries are now more complex. In order to report the information for a given date the query has to allow for the required date being between the start date

1 Data models in this document are illustrative and therefore should be viewed as suitable for making specific points rather than complete production quality solutions. Some errors exist to explicitly demonstrate certain issues. 2 The are several conventions for data modelling. In this and subsequent diagrams the link with a 1 and ∞ represents a one-to-many relationship where the ‘1’ record is a primary key field and the ‘∞’ represents the foreign key field. 3 Note that the ‘WIDGET_LOCATIONS’ table requires an additional field called ‘INSTALL_SEQUENCE’ to allow for the case where a widget is re-installed in a cabinet.



Page 7

and the end date of the record in each of the tables. The extra complexity slows the execution of the query.

o The volume of data stored has also increased. The storage of dates has a minor impact on the size of each row but this is small when compared to the number of additional rows that need to be stored.4

o Data has to be moved from the operational system to the reporting system via an extract, transform and load (ETL) process. This process has to extract the data from the operational system, compare the records to the current records in the reporting system to determine if there are any changes and if so make the required adjustments to the existing record (e.g. updating the end date) and insert the new record. Already the process is more complex and time consuming than simply copying the data across.5

Figure 2 - Initial Reporting System Data Model

When the reporting system is built, it accurately reflects the current business processes, operational systems and provides historical data. From a systems management perspective there is now an additional database, and a series of ETL or interface scripts that have to be run reliably every day. The systems architecture may be further enhanced so that the reporting system becomes a data warehouse and the users make their queries on data marts, or sets of tables where the data has been re-structured in order to simplify of the users query environment. The ‘data marts’ typically use star-schema or snowflake-schema data modelling techniques or tool specific storage strategies6. This adds an additional layer of ETL to move between the data warehouse and the data mart. However the company doesn’t stop here. The product development team create a new type of widget. This new widget allows amber lamps and can optionally be mounted in a rack that is in turn mounted in a cabinet. The IT director also insists that the new OLTP application is more flexible for other future developments.

4 Assume that everything remains the same except that widgets are moved around (i.e. there are no new widgets and no new cabinet/customer combination) then the WIDGET_LOCATIONS table grows in direct proportion to the number of changes. If each widget were modified in some way once a month then the reporting system table would be twelve times bigger than the operational system after one year and this before any other change is handled. 5 Additional functionality such as data cleansing will also impact the complexity of ETL and affect performance 6 This is accepted good practice and the design and implementation of data marts is outside the scope of this paper.



Page 8

These business process changes results in a new data model for the operational system.

Figure 3 - Second Version Operational System Data Model

The reporting system is also now a live system with a large amount of historical information. It too can be re-designed. The operational system will be implemented to meet the business requirements and timescales regardless of whether the reporting system is ready. It also may not be possible to create the history required for the new data model when it is changed.7 If a data mart is built from the data warehouse there are two impacts. Firstly that the data mart model will need to be changed to exploit the new data and secondly that the change to data warehouse model will require the data mart ETL to be modified regardless of any changes to the data mart data model. The example company does not stop here however as senior management decide to acquire a smaller competitor. The new subsidiary has it’s own systems that reflect their own business processes. The data warehouse was built with a promise of providing an integrated management reporting so there is an expectation that the data from the new source system will be quickly and seamlessly integrated into the data warehouse. From a technical perspective this could present issues around mapping the new source system data model to the existing data warehouse data model, critical information data types8, duplication of keys9, etc. that all cause problems with the integration of data and therefore slow down the processing. Within a few short iterations of change it is possible to see the dramatic impact on the data warehouse and that the system is likely to run into issues.

7 A common example of this is an organisation that captures the fact that an individual is married or not. Later the organisation decided to capture the name of the partner if someone is married. It is not possible to create the historical information systemically so for a period of time the system has to support the continued use of the marital status and then possibly run other activities such as outbound calling to complete the missing historical data. 8 The example database assumed that serial number was numeric and used it as a primary key but what happens if the acquired company uses alphanumeric serial numbers? 9 If both companies use numbers starting from 1 for their customer ID then there will be two customers who have the same ‘unique’ id, and customers that have two ‘unique’ IDs.



Page 9

The Real World

The example above is designed to illustrate some of the issues that affect data warehouse data modelling. In reality business and technical analysts will handle some of these issues in the design phase but how big is the data-modelling problem in the real world?

o A UK transport industry organisation has three mainframes, each of which is only allowed to perform one release a quarter. Each system also feeds the data warehouse. As a consequence the mainframe feeds require validation and change every month. Whilst the main data comes from these three systems there are sixty-five other Unix based operational system that feed the data warehouse and data from several hundred desktop based applications that are also provide data. Most of these source systems do not have good change control or governance procedures to assist in impact analysis. Change for this organisation is business as usual.

o A global ERP vendor supplies a system with over five thousand database objects and typically makes a major release every two years, a ‘dot’ release every six months and has numerous patches and fixes in between each major release. This type of ERP system is in use in nearly every major company and the data is a critical source to most data warehouses.

o A global food and drink manufacturer that came into existence as a result of numerous mergers and acquisitions and also divested some assets found itself with one hundred and thirty-seven general ledger instances in ten countries with seventeen different ERP packages. Even where the ERP packages were the same they were not necessarily using the same version of the package. The business intelligence requirement was for a single data warehouse and a single data model.

o A European Telco purchased a three hundred-table ‘industry standard’ enterprise data model from a major business intelligence vendor and then spent two years analysing it before they started the implementation. Within six months of implementation they had changed some sixty percent of tables as a result of analysis omissions.

o A UK based banking and insurance business outsources all of its product management to business partners and only maintains the unified customer management systems (website, call centres and marketing). As a result nearly all of the ‘source systems’ are external to the organisation and whilst there are contractual agreements about the format and data remaining fixed in practice there is significant regular change in the format and information provided to both operational and reporting systems.

Obviously these issues cannot be fixed just by creating the correct data model for the data warehouse10 but the objective of the data model design should be two fold:

o To ensure that all the required data can be stored effectively in the data warehouse.

o To ensure that the design of the data model does not impose cost and where possible actively reduces the cost of change on the system.

10 Data Management & Warehousing have published a number of other white papers that are available at http://www.datamgmt.com and look at other aspects of data warehousing and address some of these issues. See Further Reading at the end of this document for more details.



Page 10

The Customer Paradigm Data Warehouse development often start with a requirements gathering exercise. This may take the form of interviews or workshops where people try to define what the customer is. If a number of different parts of the business are involved then the definition of customer soon becomes confused and controversial and negatively impacts the project. Most organisations have a sales funnel that describes the process of capturing, qualifying, converting and retaining customers.

Marketing say that the customer is anyone and everyone that they communicate with. The sales teams view the customer as those organisations in their qualified lead database or for whom they have account management responsibility post-sales. The customer services team are clear that the customer is only those organisations who have purchased a product and where appropriate have purchased a support agreement as well. Other questions are asked in the workshops such as “What about customers who are also suppliers or partners?” and “How do we deal with customers who have gone away and then come back after a long period of time?” The most common solutions that are created as a result either add ‘flag’ or

‘indicator’ columns to the customer table to represent each category or to create multiple tables for the different categories required and to repeat the data in each of the tables. This example clearly demonstrates that the business process is being embedded into the data model. The current business process definition(s) of customer are defining how the data model is created. What has been forgotten is that these ‘customers’ exist outside the organisation and it is their interaction with different parts of the organisation that defines their status of being a customer, supplier, etc. In legal documents there is the concept of a ‘party’ where a party is a person or group of persons that compose a single entity that can be identified as one for the purposes of the law11. This definition is one that should be borrowed and used in the data model. If users query a data mart that is loaded with data extracted from the transaction repository and data marts are built for a specific team or function that only requires one definition of the data then the current12 definition can be used to build that data mart and different definitions used for other departments.

11 http://en.wikipedia.org/wiki/Party_(law) 12 This also allows flexibility, as, when business processes change, it is possible at a cost to change the rules by which data is extracted. The cost of change is relatively much lower than trying to rebuild the data warehouse and data mart with a new definition.

Figure 4 - The Sales Funnel



Page 11

As a result of this approach two questions are common:

• Isn’t one of the purposes of building a data warehouse to have a single version of the truth? Yes. There is a single version of the truth in the data warehouse and this single version is perpetuated into the data marts, the difference is that the information in the data mart is qualified. Asking the question “How many customers do we have?” should get the answer “Customer Services have X active service contract customers” and not the answer “X” without any further qualification.

• What happens if different teams or departments have different data? People within the organisation work within different processes and with the same terminology but often different definitions, it is unlikely and impractical in the short term to change this, although it is possible that in the long term the data warehouse project will help with the standardization process. In the mean time it is an education process to ensure that answers are qualified. It is important to recognise that different departments legitimately have different definitions and therefore to recognise and understand the differences, rather than fighting about who is right.

It might be argued that there are too many differences to put all individuals and organisations in a single table; this and other issues will be discussed later in the paper.



Page 12

Requirements of a Data Warehouse Data Model Having looked at the problems that can affect a data warehouse data model it is possible to describe the requirements that should be made on any data model design.

Assumptions

1. The data model is for use in the architectural component called the transaction repository or data warehouse.13

2. As the data model is used in the data warehouse it will not be a place where users go to query the data, instead users will query separate dependant data marts.

3. As the data model is used in the data warehouse data will be extracted from it to populate the data marts by ETL tools.

4. As the data model is used in the data warehouse the data will be loaded into it from the source systems by ETL tools.

5. Direct updates (i.e. not through formally released ETL processes) will be prohibited; instead a separate application or applications will exist as a surrogate source.

6. The data model will not be used in a ‘mixed mode’ where some parts use one data modelling convention and other parts use another. (This is generally bad practice with any modelling technique but often the outcome where the responsibility for data modelling changes is distributed or re-assigned over time).

Requirements

1. The data model will work on any standard business intelligence relational database.14 This is to ensure that it can be deployed on any current platform and if necessary re-deployed on a future platform.

2. The data model will be process neutral i.e. it will not reflect current business processes, practices or dependencies but instead will store the data items and relationships as defined by their use at the point in time when the information is acquired.

3. The data model will use a design pattern15 i.e. a general reusable solution to a commonly occurring problem. A design pattern is not a finished design but a description or template for how to solve a problem that can be used in many different situations.

13 For further information on Transaction Repositories see the Data Management & Warehousing white paper ”An Overview Architecture For Enterprise Data Warehouses” 14 A typical list would (at the time of writing) include IBM DB2, Microsoft SQL Server, Netezza, Oracle, Sybase, Sybase IQ, and Teradata. For the purposes of this document it implies compliance with at least the SQL92 standard 15 http://en.wikipedia.org/wiki/Software_design_pattern



Page 13

4. Convention over configuration16: This is a software design paradigm which seeks to decrease the number of decisions that developers need to make, gaining simplicity, but not necessarily losing flexibility. It can be applied successfully to data modelling and reduce the number of decisions of the data modeller by ensuring that tables and columns use a standard naming convention and are populated and queried in a consistent fashion. This also has a significant impact on the efforts of an ETL developer.

5. The design should also follow the DRY (Don’t Repeat Yourself) principle. This is a process philosophy aimed at reducing duplication. The philosophy emphasizes that information should not be duplicated, because duplication increases the difficulty of change, may decrease clarity, and leads to opportunities for inconsistency.17

6. The data model should be significantly static over a long period of time, i.e. there should not be a need to add or modify tables on a regular basis. In this case there is a difference between designed and implemented, it is possible to have designed a table but not to implement it until it is actually required. This does not affect the static nature of the data model, as the placeholder already exists.

7. The data model should store data at the lowest possible level18 and avoid the storage of aggregates.

8. The data model should support the best use of platform specific features whilst not compromising the design.19

9. The data model should be completely time-variant, i.e. it should be possible to reconstruct the information at any available point in time.20

10. The data model should act as a communication tool to aid the refinement of requirements and an explanation of possibilities.

16 For further information see http://en.wikipedia.org/wiki/Convention_over_Configuration and http://softwareengineering.vazexqi.com/files/pattern.html. The Ruby on Rails language (http://www.rubyonrails.org/) makes extensive use of this principle. 17 DRY is a core principle of Andy Hunt and Dave Thomas's book The Pragmatic Programmer. They apply it quite broadly to include "database schemas, test plans, the build system, even documentation." When the DRY principle is applied successfully, a modification of any single element of a system does not change other logically unrelated elements. Additionally, elements that are logically related all change predictably and uniformly, and are thus kept in sync. (http://en.wikipedia.org/wiki/DRY). This does not automatically imply database normalisation but database normalisation is one method for ensuring ‘dryness’. 18 This is the origin of the term ‘Transaction Repository’ rather than ‘Data Warehouse’ in Data Management & Warehousing documentation. The transaction repository stores the lowest level of data that is practical and/or available. (See An Overview Architecture for Enterprise Data Warehouses) 19 This turns out to be both simple and very effective. For Oracle the most common features that need support include partitioning and materialized views. For Sybase IQ and Netezza there is a preference for inserts over updates due to their internal storage mechanisms. For all databases there is variation in indexing strategies. These and other features should be easily accommodated. 20 Also known as temporal. Most data warehouses are not linearly time variant but quantum time variant. If a status field is updated three times in a day and the data warehouse reflects all changes then it is linearly time-variant. If a data warehouse holds the first and last values only because a batch process loads it once a day then it is quantum time-variant where the quantum is, in this case, one day. Quantum time variant solutions can only resolve data to the level of the quantum unit of measure.



Page 14

The Data Model As this white paper has defined requirements for the data model it is now possible to start looking at what is needed to design a data model. This is done by breaking down the tables that will be created into different groups depending on how they are used. The section below discusses the main elements of the data models. There are some basics such as naming conventions, standard short names, keys used in the data model, etc. that are not described. A complete set of data modelling rules and example models can be found in the appendices.

Major Entities Party is, as described in the customer paradigm section above, an example of a type of table within the Process Neutral Data Modelling method known as a ‘Major Entity’. These are tables that deliver the placeholders for all major subject areas of the data model and around which other information is grouped. Each business transaction will relate to a number of major entities. Some major entities are global i.e. they apply to all types of organisation (e.g. Calendar) and there are a number of major entities that are industry specific (e.g. for Telco, Manufacturing, Retail, Banking, etc.). It would be very unusual for an organisation to need a major entity that was not industry wide. Below is a list of some of the most common:

• Calendar Every data warehouse will need a calendar. It should always contain data to the day level and never to parts of the day. In some cases there is a need to support sub-types of calendar for non-Gregorian calendars21.

• Party Every organisation will have dealings between parties. This will normally include three major sub-types: individuals, organisations (any formal organisation such as a company, charity, trust, partnership, etc.) and organisational units (the components within an organisation including the system owners organisation).

• Geography The information about where. This is normally sub-typed into two components, address and location. Address information is often limited to postal addresses22 whilst location is normally described by the longitude and latitude via GPS co-ordinates. Other specialist geographic models exist that may need to be taken into account.23

• Product_Service (also known as Product or as Service) This is the catalogue of the products and/or services that an organisation supplies.

• Account Every customer will have at least one account if financial transactions are involved (even those organisations that do not think they currently use the concept of account will do so as accounting systems always have the concept of a customer with one or more accounts).

21 See http://www.qppstudio.net/footnotes/non-gregorian.htm for various calendars, notably 2008 is the Muslin Year 1429 and the Jewish Year 5968 22 Some countries, such as the UK, have validated lists of all addresses (see the UK Post Office Postcode Address File at http://www.royalmail.com/portal/rm/jump2?mediaId=400085&catId=400084) 23 Network Rail in the UK use an Engineers Line Reference, which is based on a linear reference model and refers to a known distance from a fixed point on a track. In Switzerland they have an entire national co-ordinate system (http://en.wikipedia.org/wiki/Swiss_coordinate_system)



Page 15

• Electronic_Address Any electronic address such as a telephone number, email address, web address, IP address etc. This is normally sub-typed by the categories used.

• Asset (also known as Equipment) A physical object that can be uniquely identified (normally by a serial number or similar). This may be used or incorporated in a PRODUCT_SERVICE, or sold to a customer etc. In the example Cabinet, Rack and Widget were all examples of Asset, whilst Widget Type was an example of PRODUCT_SERVICE.

• Component A physical object that cannot be uniquely identified by a serial number but has a part number and is used in the make-up of either an asset or of a product service. In the example company there was not a particular record of the serial numbers of the lamps, however they would all have had a part number that described the type of lamp to be used.

• Channel

A conceptual route to market (e.g. direct, indirect, web-based, call-centre, etc.).

• Campaign A marketing exercise that is designed to promote the organisation, e.g. the running of a series of adverts on the television.

• Campaign Activities

The running of a specific advert as part of a larger campaign.

• Contract Depending on the type of business the relationship between the organisation and its supplier or its customer may require the concept of a contract as well as that of an account.

• Tariff (also known as Price_List)

A set of charges and discounts that can be applied to product services as a point in time.

This list is not comprehensive by if an organisation can effectively describe their major entities and combine this information with the interactions between them (the occurrences or transactions) then they have the basis of a very successful data warehouse. Major Entities can have any meaningful name provided it is not a reserved word in the database or (as will be seen below) a reserved word within the design pattern of Process Neutral Data Modelling. Some readers, who are familiar with the concepts of star schemas and data marts, will also be aware that these are very close to the basic dimensions that most data marts use. This should come as no surprise as these are the major data items of any business regardless of their business processes or of their specific industry sector and a data mart is only a simplification of the data presented for the user. This effect is called “natural star schemas” and will be explored in more detail later.



Page 16

Lifetime Value The next decision is which columns (attributes) should be included in the table. Much like the processes involved in normalising a database24 the objective is to minimise duplication of data and there is also a requirement to minimise updates. To this end the attributes that are included should therefore have ‘lifetime value’, i.e. they should remain constant once they have been inserted into the database. This means that variable data needs to be handled elsewhere. Using some of the major entities above as examples:

Calendar: Lifetime Value Attributes: Date, Public Holiday Flag Geography: Lifetime Value Attributes: Address Line 1, Address Line 2, City,

Postcode25, County, Country Non-Lifetime Value Attributes: Population Party (Individuals):

Lifetime Value Attributes: Forename, Surname26, Date of Birth, Date of Death, Gender27, State ID Number

Non-Lifetime Value Attributes: Marital Status, Number of Children, Income Party (Organisations): Lifetime Value Attributes: Name, Start Date, End Date,

State ID Number Non-Lifetime Value Attributes: Number of Employees, Turnover,

Shares Issued Account: Lifetime Value Attributes: Account Number, Start Date, End Date. Non-Lifetime Value Attributes: Balance

Other than this lifetime value requirement for columns every table must comply with the general rules for any table. For example every table will have a key column that uses the table short name made up of six characters and the suffix _DWK28, a TIMESTAMP column and an ORIGIN column.

24 http://en.wikipedia.org/wiki/Database_normalization: Database normalization is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies. 25 This may occasionally be a special case as postal services do, from time to time, change postal codes that are normally static. 26 There is a specific special case that deals with the change of name for married women that will be dealt with in the section ‘The Party Special Case’ later. 27 One insurance company had to deal with updatable genders due to the fact that underwriting rules require assessment based on birth gender and not gender as a result of re-assignment surgery. Therefore for marketing it had to handle ‘current’ gender and for underwriting it had to deal with ‘birth’ gender. 28 See the data modelling rules appendix for how this name is created.



Page 17

Type Tables

There is often a need to categorise information into discrete sets of values. The valid set of categories will probably change over time and therefore each category record also needs to have lifetime value. Examples of the categorisation have already occurred with the some of the major entities:

• Party: Individual, Organisation, Organisation Unit • Geography: Postal Address, Location • Electronic Address: Telephone, E-Mail

To support this and to comply with the requirement for convention over configuration all _TYPES tables of this format have a standard data model as follows:

• The table will have the same name as the major entity but with the suffix _TYPES (e.g. PARTY_TYPES, GEOGRAPHY_TYPES, etc.).

• The table will always have a key column that uses the six character short code and the _DWK suffix.

• The table will have a _TYPE column that is the type name. • The table will have a _DESC column that is a description of the type. • The table will have a _GROUP column that groups certain types together. • The table will have a _START_DATE column and a _END_DATE column.

This is a type table in its entirety. If a table needs more information (i.e. columns) then this is not a _TYPES table and must not have the _TYPES extension, as it does not comply with the rules for a _TYPES table. Examples of data in _TYPES tables might include: PARTY_TYPES

Column Example Rows PARTYP_DWK 1 2 3 4 PARTY_TYPE INDIVIDUAL LTD COMPANY PARTNERSHIP DIVISION PARTY_TYPE_DESC An Individual A company in

which the liability of the members in respect of the company’s debts is limited

This is a business owned by two or more people who are personally liable for all business debts.

A division of a larger organisation

PARTY_TYPE_GROUP INDIVIDUAL ORGANISATION ORGANISATION UNIT PARTY_TYPE_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900 01-JAN-1900 PARTY_TYPE_END_DATE

Figure 5 - Example data for PARTY_TYPES The start date in this context has little initial value in this context, although it is a mandatory field29 and therefore has to be completed with a date before the earliest party in this example. Legal types of organisation do change over time and so it is possible that the start and end dates of these will become significant. These types do not describe the type of role that the party is performing (i.e. Customer, Supplier, etc.) they describe the type of the party (e.g. Individual, etc.). Describing the role comes later. The type and group column are repeated for INDIVIDUAL, as there is no hierarchy of information for this value but the field is mandatory.

29 Start Dates in _TYPES tables are mandatory as, with only a few exceptions, they are required information. In order to be consistent they therefore have to be mandatory for all _TYPES tables



Page 18

GEOGRAPHY_TYPES

Column Example Rows GEOTYP_DWK 1 2 GEOGRAPHY_TYPE POSTAL LOCATION GEOGRAPHY_TYPE_DESC An address as supported by

the postal service A point on the surface of the earth defined by it’s longitude and latitude

GEOGRAPHY _TYPE_GROUP POSTAL LOCATION GEOGRAPHY _TYPE_START_DATE 01-JAN-1900 01-JAN-1900 GEOGRAPHY _TYPE_END_DATE

Figure 6 - Example Data for GEOGRAPHY_TYPES The start date in this context has little initial value, although it is a mandatory field and therefore has to be completed with a date. These types do not describe the type of role that the geography is performing (i.e. home address, work address, etc.) they describe the type of the geography (postal address, point location, etc.). The type and group column are repeated for both values, as there is no hierarchy of information for them. CALENDAR_TYPES The convention over configuration design aspect allows for this table, however it is rarely needed and can therefore be omitted. This is an example where a table can be described as designed (i.e. it is known exactly what it looks like) but not implemented. _TYPES tables will appear in other parts of the data model but they will always have the same function and format. The consequence of this design re-use is that implementing an application30 to manage the source of _TYPE data is easy. The system than manages the type data needs to have a single table with the same columns as a standard _TYPES table and an additional column called, for example, DOMAIN. This DOMAIN column has the target system table name (e.g. PARTY_TYPES) in it. The ETL then simply maps the data from the source system to the target system where the DOMAIN equals the target table name. This is an example of re-use generating a significant saving in the implementation.

30 This is a good use of a Warehouse Support Application as defined in “An Overview Architecture for Enterprise Data Warehouses”



Page 19

Band Tables

Whilst _TYPES tables classify information into discrete values it is sometimes necessary to classify information into ranges or bands i.e. between one value and another. The classic example of this is for telephone calls which are classified as ‘Off-Peak Rate’ if they are between 00:00 and 07:59 or between 18:00 and 23:59. Calls between 08:00 and 17:59 are classified as ‘Peak Rate’ and charged at a premium. _BANDS is a special case of the _TYPES table and would store the data as follows:

Column Example Rows TIMBAN_DWK 1 2 3 TIME_BAND Early Off Peak Peak Late Off Peak TIME_BAND_START_VALUE 031 480 1080 TIME_BAND_END_VALUE 479 1079 1439 TIME_BAND_DESC Early Off Peak Peak Late Off Peak TIME_BAND_GROUP Off Peak Peak Off Peak TIME_BAND_START_DATE 01-JAN-1900 01-JAN-1900 01-JAN-1900 TIME_BAND_END_DATE

Figure 7 - Example data for TIME_BANDS

Once again the _BANDS table has a standard format as follows

• The table will have the same name as the major entity but with the suffix _BANDS (e.g. TIME_BANDS, etc.).

• The table will always have a key column that uses the six character short code and the _DWK suffix.

• The table will have a _BAND column that is the type name. • The table will have a _START_VALUE and a _END_VALUE that represent the

starting and finishing values of the band. • The table will have a _DESC column that is a description of the band. • The table will have a _GROUP column that groups certain band together. • The table will have a _START_DATE column and a _END_DATE column.

The table has to comply with this convention in order to be given the _BANDS suffix.

31 Note that values are stored as a number of minutes since midnight.



Page 20

Property Tables

In the discussion of major entities and lifetime value the data that failed to meet the lifetime value principle was omitted from the major entity tables, however it still needs to be stored. This is handled via a property table. Property tables also help to support the extensibility aspects of the data model. If we use PARTY as an example then as already identified the marital status does not possess lifetime value and therefore is not included in the major entity. Everyone starts as single, some marry, some divorce and some are widowed, these ‘status changes’ occur through the lifetime of the individual. To deal with this problem the property table can be modelled as follows:

Figure 8 - Party Properties Example As can be seen from example above in order to handle the properties two new tables are created. The first is the PARTY_PROPERTIES table itself and the second a supporting PARTY_PROPERTY_TYPES table. In order to store the marital status of an individual a set of data needs to be entered in the PARTY_PROPERTY_TYPES table:

TYPE GROUP Single Marital Status Married Marital Status Divorced Marital Status Co-Habiting Marital Status

Figure 9 - Example Party Property Data The description, start and end date would be filled in appropriately. Note that the start and end date here represent the start and end date of the type and not that of the individuals’ use of that type.32 It is now possible to insert a row in the PARTY_PROPERTIES table that references the individual in the PARTY table and the appropriate PARTY_PROPERTY_TYPES (e.g. ‘Married’). The PARTY_PROPERTIES table can also hold the start date and end date of this status and optionally where appropriate a text or numeric value that relates to that property.

32 The need for start and end dates on such items is often questioned however experience shows that legislation changes supposed static values in most countries over the lifetime of the data warehouse. For example in December 2005 the UK permitted a new type of relationship called a civil partnership. http://en.wikipedia.org/wiki/Civil_partnerships_in_the_United_Kingdom.



Page 21

This means that not only the current marital status can be stored but also historical information.

PARTY_DWK33 PARTY_PROPERTY_DWK START_DATE END_DATE John Smith Single 01-Jan-1970 02-Feb-1990 John Smith Married 03-Feb-1990 04-Mar-2000 John Smith Divorced 05-Mar-2000 06-Apr-2005 John Smith Co-Habiting 07-Apr-2005

Figure 10 - Example data for PARTY_PROPERTIES The data shown here describes the complete history of an individual with the last row showing the current state as the START_DATE is before ‘today’ and the END_DATE is null. There is also nothing to prevent future information from being held. If John Smith announces that he is going to get married on a specific date in the future then the current record can have it’s end date set appropriately and a new record added. If another property is required (e.g. Number of Children) then no change is required to the data model. New rows are entered into the PARTY_PROPERTY_TYPES table:

TYPE GROUP Male Number of Children Female Number of Children

Figure 11 - Example Data for PARTY_PROPERTY_TYPES This allows data to be added to the PARTY_PROPERTIES as follows: PARTY_DWK PARTY_PROPERTY_DWK START_DATE END_DATE VALUE John Smith Single 01-Jan-1970 02-Feb-1990 John Smith Married 03-Feb-1990 04-Mar-2000 John Smith Divorced 05-Mar-2000 06-Apr-2005 John Smith Co-Habiting 07-Apr-2005 John Smith Male 09-Jun-2001 1 John Smith Female 10-Jul-2002 1

Figure 12 - Example Data for PARTY_PROPERTIES In fact any number of new properties can be added to the tables as business processes and source systems change and new data requirements come about. The effect of this method when compared to other methods of modelling this information is to create very narrow (i.e. not many columns) long (i.e. many rows) tables instead of making very much wider, shorter tables. However the properties table is very effective. Firstly, unlike the example, the two _DWK columns are integers34, as are the start and end dates. Many of the _VALUE fields will be NULL, and those that are not will be predominately numeric rather than text values. The PARTY_PROPERTY_TYPE acts as a natural partitioning key in those databases that support table partitions. This method is very effective in terms of performance and storage of data in databases that use column or vector type storage.

33 Text from the related table is used in the _DWK column rather than the numeric key for clarity in these examples. 34 Integers are better than text strings for a number of reasons: they usually require less storage and there is less temptation to mix the requirements of identification and description (a problem clearly illustrated by car registration numbers in the UK). Keys are more reliable when implemented as integers because databases often have key generation mechanisms that deliver unique values. Integers do not suffer from upper/lower case ambiguities and can never contain special characters or ambiguities caused by different padding conventions (trailing spaces or leading zeros).



Page 22

The real saving in the number of rows is normally less than expected when compared to more conventional data model techniques that store duplicated rows for changed data. The example above has seven rows of data. The alternate approach of repeated sets of data requires six rows of data and considerably more storage because of the duplicated data:

PARTY_DWK START_DATE END_DATE MARITAL_STATUS

CH

ILD

U

NK

NO

WN

CH

ILD

M

ALE

CH

ILD

FE

MA

LE

John Smith 01-Jan-1970 02-Feb-1990 Single 0 0 0 John Smith 03-Feb-1990 08-Jun-2001 Married 0 0 0 John Smith 09-Jun-2001 09-Jul-2002 Married 0 1 0 John Smith 10-Jul-2002 04-Mar-2000 Married 0 1 1 John Smith 05-Mar-2000 06-Apr-2005 Divorced 0 1 1 John Smith 07-Apr-2005 Co-Habiting 0 1 1

Figure 13 - Example Data for PARTY_PROPERTIES

The other main objection to this technique is often described as the cost of matrix transformation of the data. That is the changing of the data from rows into columns in the ETL to load the data warehouse and then changing the columns back to rows in the ETL to load the data mart(s). This objection is normally due to a lack of knowledge of appropriate ETL techniques that can make this very efficient such as using SQL set operations such as ‘UNION’, ‘MINUS’ and ‘INTERSECT’.

Event Tables

An event table is almost identical to a property table except that instead of having _START_DATE and _END_DATE columns it has a single column _EVENT_DATE. It also has the appropriate _EVENT_TYPES table. The table name has a suffix of _EVENTS. For example a wedding is an event (happens at a single point in time), but ‘being married’ is a property (happens over a period of time). Events can be stored in property tables simply by storing the same value in both the start date and end date columns and this is a more common solution than creating a separate table. The use of _EVENTS tables is usually limited to places where events form a significant part of the data and the cost of storing the extra field becomes significant. It should be noted that this is only required where the event may occur many times (e.g. a wedding date) rather than information that can only happen once (e.g. first wedding date) which would be stored in the appropriate major entity as, once set, it would have lifetime value.

Figure 14 - Party Events Example _EVENTS tables are a special case of _PROPERTIES tables.



Page 23

Link Tables

Up to this point major entity attributes within a single record have been examined. It is also possible that records within the major entities will also relate to other records in the same major entity (e.g. John Smith is married to Jane Smith, both of whom are records within the PARTIES table). This is called a peer-to-peer relationship and is stored in a table with the suffix _LINKS and the appropriate _LINK_TYPES table.

Figure 15 - Party Links Example

The significant difference in a _LINK table is that there are two relationships from the major entity (in this case PARTIES). This also allows hierarchies to be stored so that:

John Smith (Individual) works in Sales (Organisational Unit) Sales (Organisation Unit) is a division of ACME Enterprises (Organisation)

where ‘works in’ and ‘is a division of’ are examples of the _LINK_TYPE. It should also be noted that there is a priority to the relationship because one of the linking fields is the main key (in this case PARTIE_DWK) and the other is the linked key (in this case LINKED_PARTIE_DWK). There are two options; one is to store the relationship in both directions (e.g. John Smith is married to Jane Smith and Jane Smith is married to John Smith). This can be made complete with a reversing view35 but defeats both the ‘Convention over Configuration’ principle and the ‘DRY (Don’t Repeat Yourself)’ principle. The second method is to have a convention and only store the relationship in one direction (e.g. John Smith is married to Jane Smith, therefore the convention could be that that the male is being stored in the main key and the female is being stored in the linked key).

35 A reversing view is one that has all the same columns as the underlying table except that the two key columns are swapped around. In this example PARTIE_DWK would be swapped with LINKED_PARTIE_DWK.



Page 24

Segment Tables The final type of information that might be required about a major entity is the segment. This is a collection of records from the major entity that share something in common but more detail is not known. The most common business example of this would be the market segmentations done on customers. These segments are normally a result of detailed statistical analysis and then storing the results. In our example John Smith and Jane Smith could both be part of a segment of married people along with any number of other individuals for whom it is known that they are married but there is no information about when or to whom they are married. Where the _LINKS table provided the peer-to-peer relationship the segment provides the peer group relationship.

Figure 16 - Party Segments Example



Page 25

The Sub-Model The major entities and the six supporting data structures (_TYPES, _BANDS, _PROPERTIES, _EVENTS, _LINKS, and _SEGMENTS) provide sufficient design pattern structure to hold a large part of the information in the data warehouse. This is known as a Major Entity Sub-Model. Significantly the information that has been stored for a single major entity sub-model is very close to the typical dimensions of a data mart. This design pattern provides complete temporal support and the ability to re-construct a dimension or dimensions based on a given set of business rules. The set of a major entity and the supporting structures is known as a sub-model. For example the designed PARTY sub-model consists of:

• PARTIES

• PARTY_TYPES • PARTY_BANDS

• PARTY_PROPERTIES • PARTY_PROPERTY_TYPES

• PARTY_EVENTS • PARTY_EVENT_TYPES

• PARTY_LINKS • PARTY_LINK_TYPES

• PARTY_SEGMENTS • PARTY_SEGMENT_TYPES

Those tables in bold italics might represent the implemented PARTY sub-model Importantly what has not been provided is the relationships between major entities and the business transactions that occur as a result of the interaction between major entities.



Page 26

History Tables

Extending the example above it is noticeable that the party does not contain any address information; this is held in the geography major entity. This is also another example where current business processes and requirements may change. At the outset the source system may provide a contract address and a billing address. A change in process may require the capture of additional information e.g. contact addresses and installation addresses. In practice the only difference between this type of relationship between major entities and the _LINKS relationship is that instead of two references to the same major entity there is one relationship to each of two major entities. The data model is therefore relatively simple to construct:

Figure 17 – Party Geography History Example

There is one minor semantic difference between links and histories. _LINKS tables join back on to the major entity and therefore one half of the relationship has to be given priority. In a _HISTORY table there is no need for priority as each of the two attributes is associated with a different major entity. Finally note that in this example the major entity is shown without the rest of the sub-model that can be assumed.



Page 27

Occurrences and Transactions

The final part of the data model is to build up all the occurrence or transaction tables. In the data mart these are most akin to the fact tables although as this is a relational model they may occur outside a pure star relationship. Like the major entities there is no standard suffix or prefix, just a meaningful name. To demonstrate what is required an example from a retail bank is described. The example is not nearly as complex as a real bank but necessarily longer and more complex than most examples to demonstrate a number of features. Furthermore banking has been chosen as an example because the concepts will be familiar to most readers. The example only looks at some core banking function and not at the activities such as marketing or specialist products such as insurance.

The Example

The bank has a number of regions and a central ‘premium’ account function that caters for some business customers. Each region has a number of branches. Branches have a manager and a number of staff. Each branch manager reports to a regional manager. If a customer has a personal account then the account manager is a branch personal account manager, however if the individual has a net worth in excess of USD1M the branch manager acts as the account manager. Personal accounts have contact and statement addresses and a range of telephone numbers, e-mails, addresses, etc. If the account belongs to a business with less than USD1M turnover then the account manager is a business account manager at the branch who reports to the branch manager. If the account belongs to a business with a turnover of between USD1M and USD10M then the account manager is an individual at the regional office who reports to the regional manager. If the account belongs to a business with a turnover more than USD10M then the account managers at the central office are responsible for the account. Businesses have contact and statement addresses as well as a number of approved individuals who can use the company account and contact details for them. Branch and account managers periodically review the banding of accounts by income for individuals and turnover for companies and if they are likely to move band in the coming year then they are added to the appropriate (future) category. Note that this is only partially fact based, the rest being based on subjective input from account managers. The bank offers a range of services including current, loan and deposit accounts, credit and debit cards, EPOS (for business accounts only), foreign exchange, etc. The bank has a number of channels including branches, a call centre service, a web service and the ability to use ATMs for certain transactions. The bank offers a range of transaction types including cash, cheque, standing order, direct debit, interest, service charges, etc.



Page 28

After the close of business on the last working day of each month the starting and ending balances, the average daily balance and any interest is calculated for each account. On a daily basis the exposure (i.e. sum of all account balances) is calculated for each customer along with a risk factor that is a number between 0 and 100 that is influenced by a number of factors that are reviewed from time to time by the risk management department. Risk factors might include sudden large deposits or withdrawals, closure of a number of accounts, long-term non-use of an account, etc. that might influence account managers’ decisions. Every transaction that is made is recorded every day and has three associated dates, the date of the transaction, the date it appeared on the system and the cleared date.

De-constructing the example

The bank has a number of regions and a central ‘premium’ account function that caters for some business customers. Each region has a number of branches. Branches have a manager. Each branch manager reports to a regional manager.

• The bank itself must be held as an organisation. • The regions and central ‘premium’ account function are held as

Organisation Units.36 • The bank and the regions have links. • The branches are held as organisational units. • The regions and the branches have links. • The branches have addresses via a history table. • The branches have electronic addresses via a history table. • There are a number of roles stored as organisation units. • There roles and the individuals have links. • The roles may have addresses via a history table. • The roles may have electronic addresses via a history table. • The individuals may have addresses via a history table. • The individuals have electronic addresses via a history table.

At this point only existing major entities and history tables have been used. Also this information would be re-usable in many places just like the conformed dimensions concept of star schemas but with more flexibility. If a customer has a personal account then the account manager is a branch personal account manager, however if the individual has a net worth in excess of USD1M the branch manager acts as the account manager. Personal accounts have contact and statement addresses and a range of telephone numbers, e-mails, etc.

• Customers are held as Parties, either Individuals or Organisations. • Customers have addresses via a history table. • Customers have electronic addresses via a history table. • Accounts are held in the Accounts major entity. • Customers are related to accounts via a history table. • Branches are related to accounts via a history table. • Accounts are associated with a role via a history table. • An individual’s net worth is generated elsewhere and stored as a property

of the party.

36 See Appendix 2 – Understanding Hierarchies for an explanation as to why the regions are organisational units and not geography.



Page 29

• A high net worth individual is a member of a similarly named segment. • The accounts may have addresses via a history table. • The accounts may have electronic addresses via a history table.

If the account belongs to a business with less than USD1M turnover then the account manager is a business account manager at the branch who reports to the branch manager. If the account belongs to a business with a turnover of between USD1M and USD10M then the account manager is an individual at the regional office who reports to the regional manager. If the account belongs to a business with a turnover over USD10M then the account managers at the central office are responsible for the account. Businesses have contact and statement addresses as well as a number of approved individuals who can use the company account, and contact details for them.

• Businesses are held as parties. • The business turnover is held as a party property. • The category membership based on turnover is held as a segment. • The businesses may have addresses via a history table. • The businesses may have electronic addresses via a history table.

Branch and account managers periodically review the banding of accounts by turnover for both individuals and companies and if they are likely to move band in the coming year then they are added to the appropriate (future) category. Note that this is only partially fact based, the rest being based on subjective input from account managers.

• There is a need to allow manual input via a warehouse support application for the party segments.

At this point only the PARTY, ADDRESS, ELECTRONIC ADDRESS sub-models and associated _HISTORY tables have been used. The bank offers a range of services including current, loan and deposit accounts, credit and debit cards, epos (for business accounts only), foreign exchange, etc.

• The product services are held in the product service major entity. • The product services are associated with an account via a history.

The bank has a number of channels including branches, a call centre service, a web service and the ability to use ATMs for certain transactions.

• The channels are held in the channels major entity. • The ability to use a channel for a specific product service is held in the

history that relates the two major entities. This adds the PRODUCT_SERVICE and CHANNEL major entities into the model. The bank offers a range of transaction types including cash, cheque, standing order, direct debit, interest, service charges, etc.

• This requires a TRANSACTION_TYPE table that will be added to the transaction table, which has not yet been defined.

After the close of business on the last working day of each month the starting and ending balances, the average daily balance and any interest is calculated for each account.

• This is stored as an account property (it may be an event).



Page 30

On a daily basis the exposure (i.e. sum of all account balances) is calculated for each customer along with a risk factor that is a number between 0 and 100 that is influenced by a number of factors that are reviewed from time to time by the risk management department. Risk factors might include sudden large deposits or withdrawals, closure of a number of accounts, long-term non-use of an account, etc. that might influence account managers’ decisions.

• The exposure is stored as a party property (or event). • The party risk factor is stored as a party property.

Everything that is required to describe the transaction table is now available. Every transaction that is made is recorded every day and has three associated dates, the date of the transaction, the date it appeared on the system and the cleared date.

• The Transaction Table will have the following columns o Transaction Date o Transaction System Date o Transaction Cleared Date o From Account o To Account o Transaction Type o Amount

This would complete the model for the example. There are some interesting features to examine. The first is that all amounts would be positive. This is because for a credit to an account the ‘from account’ would be the sending party and the ‘to account’ would be the customer’s account. For a debit the ‘to account’ would be the recipient and the ‘from account’ would be the customer’s account. This has a number of effects. Firstly it complies with the DRY (Don’t Repeat Yourself) principle and means that extra data is not stored for the transaction. It also means that a collection of account information not related to any current party (e.g. a customer at another bank) is built up. This information is useful in the analysis of fraud, churn, market share, competitive analysis, etc. For a customer analysis data mart the data can be extracted and converted into the positive credit/negative debt arrangement required by the users. The payment of bank changes and interest would also have accounts and this information in a different data mart could be used to look at profitability, exposure, etc. The process has used seven major entities’ sub-models, an additional type table and an occurrence or transaction table. Storing this information should accommodate and absorb almost any change in business process or source system without the need to change the data warehouse model and will allow multiple data marts to be built from a single data warehouse quickly and easily. In effect the type tables act as metadata for how to use and extend the data model rather than defining the business process explicitly in the data model, hence the name process neutral data modelling. It also demonstrates the ability of the data model to support the requirements process. By knowing the major entities and using a storyboard approach similar to the example above, and familiar as an approach to agile developers, it is possible to quickly and easily identify business, data and query requirements.



Page 31

Party Sub Model including:

• Individuals • Organisations • Organisation Units • Roles

Addresses Sub Model including:

• Postal Address • Point Location

History

Electronic Addresses Sub Model including:

• Telephone Numbers • E-Mail Addresses • Telex

History

Accounts Sub Model

History

History

History

History

History

Product Service Sub Model Channel Sub Model

Calendar Sub Model

Transaction Types

Retail Banking Transactions

Figure 18 - The Example Bank Data Model



Page 32

The model above has been almost fully described in detail by this document since the self-similar modelling for all the sub-model components has been described along with the history tables, most of the retail banking transactions and some of the lifetime attributes of the major entities. To complete the model just needs these additional attributes to be added. Two other effects that will influence the creation of data marts from this model can also be seen. Firstly the creation of dimensions will revolve around the de-normalisation of the attributes that are required from each of the major entities into one of the two dimensions associate with account as these have the hierarchies for the customer, account manager, etc associated with them. The second effect is that of the natural star schema. It is clear from this diagram that the fact tables will be based around the ‘Retail Banking Transactions’ table. As has already been stated there are several data marts that can be built from this fact table, probably at different levels of aggregation and with different dimensions. The occurrence or transaction table above is one of perhaps twenty that a large enterprise would require along with approximately thirty _HISTORY tables. This would be combined with around twenty major entity sub models to create an enterprise data warehouse data model. For those readers who have also read and are familiar with the Data Management & Warehousing white paper ‘How Data Works’37 that describes natural star schemas in more detail and also a technique called left to right entity diagrams will see a correlation as follows: Level Description 1 _TYPE and _BAND tables, simple small volume reference data. 2 Major Entities, complex low volume data. 3 Some major entities that are dependent on others along with _PROPERTIES and _SEGMENTS

tables, less complex but with greater volume. 4 _HISTORY tables and some occurrence or transaction tables. 5 Occurrence or transaction tables. Significant volume but low complexity data. Figure 19 - Volume & Complexity Correlations

37 Available for download from http://www.datamgmt.com/whitepapers



Page 33

Implementation Issues The use of a process neutral data model and a design pattern is meant to ease the design of a system but there will always be exceptions and things that need further explanation in order to fit them into the solution. Much of this section refers to ETL issues that can only be briefly described in this context.38

The ‘Party’ Special Case

The examples throughout this document have used the PARTY table as a major entity but in practice this is one of the more difficult tables to deal with. The first issue is that in many cases name does not have lifetime value, for example when a woman gets married or divorced and changes her name or when a company renames itself.39 Also Individual names often have multiple parts (title, forename, surname). There is also a requirement to track some form of state identity number. In the United Kingdom an individual has their National Insurance number and in the United States their social security number, other numbers (e.g. passport, ID card, etc are simply stored as properties). Organisations have other numbers (Companies have registration numbers, charities and trusts have different registration numbers, but VAT numbers are properties as they can and do change). Another minor issue is that people have a date of birth and a date of death. This is simply resolved as date of birth is the Individual Start Date and date of death is the Individual End Date however this terminology can sometimes prove controversial. The solution to the PARTY special case depends on the database technology being used. If the database supports the creation of views and the ‘UNION ALL’ SQL operator then the preferred solution is as follows:40 Create the INDIVIDUALS table as follows:

• PARTY_DWK • PARTY_TYPE_DWK • TITLE • FORENAME • CURRENT_SURNAME41 • PREVIOUS_SURNAME • MAIDEN_SURNAME • DATE_OF_BIRTH • DATE_OF_DEATH • STATE_ID_NUMBER • Other lifetime attributes as required

38 Data Management & Warehousing provide consultancy on ETL design and techniques to ensure that data warehouses can be loaded effectively regardless of the data modelling approach used. 39 Interestingly, in Scotland, which has different regulations from England & Wales, birth marriage and death certificates (also known as vital records) have, since 1855, understood the importance of knowing the birth names of everyone on the certificate. For example on a wedding certificate you will get the groom’s mother’s maiden name and a married woman’s death certificate will also feature the her maiden name. Effectively the birth name has lifetime value and all other names are additional information. http://www.scotlandspeople.gov.uk/content/help/index.aspx?r=554&628 40 Nearly all business intelligence databases support this functionality. 41 CURRENT_ and PREVIOUS_ are reserved prefixes; see Appendix 1 Data Modelling Standards.



Page 34

Create the ORGANISATIONS table as follows:

• PARTY_DWK • PARTY_TYPE_DWK • CURRENT_ORGANISATION_NAME • PREVIOUS_ORGANISATION_NAME • START_DATE • END_DATE • STATE_ID_NUMBER • Other lifetime attributes as required

Create the ORGANISATION_UNITS table as follows:

• PARTY_DWK • PARTY_TYPE_DWK • CURRENT_ORGANISATION_UNIT_NAME • PREVIOUS_ORGANISATION_UNIT_NAME • START_DATE • END_DATE • Other lifetime attributes as required

This can then be mapped to a view called PARTIES as follows:

PARTIES INDIVIDUALS ORGANISATIONS ORGANISATION_UNITS PARTY_DWK PARTY_DWK PARTY_DWK PARTY_DWK PARTY_TYPE_DWK PARTY_TYPE_DWK PARTY_TYPE_DWK PARTY_TYPE_DWK CURRENT_NAME FORENAME +

CURRENT_SURNAME CURRENT_ ORGANISATION_ NAME

CURRENT_ ORGANISATION_ UNIT_NAME

PREVIOUS_NAME FORENAME + PREVIOUS_SURNAME

PREVIOUS_ ORGANISATION_ NAME

PREVIOUS _ ORGANISATION_ UNIT_NAME

START_DATE DATE_OF_BIRTH START_DATE START_DATE END_DATE DATE_OF_DEATH END_DATE END_DATE STATE_ID_NUMBER STATE_ID_NUMBER STATE_ID_NUMBER Null

Figure 20 - PARTIES view mapping It should be noted that:

• The PARTY_DWK must be unique across all the tables. • The PARTY_TYPE_DWK will be a single value in the INDIVIDUALS table. • The ORGANISATION_UNITS STATE_ID_NUMBER will be null in the view. • The ‘+’ sign represents concatenation and should include a space between

words. • Other attributes can be included in the view as deemed appropriate.

Where possible it is often beneficial to create this as a materialized view so that it can be indexed and used as a primary key to the other tables.

Whilst the PARTIES table needs all these techniques they can also be used in part on other major entities if required.

The alternate strategy where UNION ALL views are not available is to create a single table including all the columns and use those columns that are appropriate as required by the query.



Page 35

Partitioning

The Party Special Case is an example of vertical partitioning, i.e. tables that are split based on the different columns required for the different types. Queries require a view across the information in order to be able to access all the information. Figure 21 - Vertically Partitioned Data Tables can also be horizontally partitioned, i.e. whilst the table structure remains the same the table is split on some data item that changes, most commonly the date. This sometimes requires a view to be able to access all the information but is more commonly implemented in the database architecture itself. Figure 22 - Horizontally Partitioned Data If both horizontal and vertical partitioning are used together this is known as matrix partitioning. This is uncommon. In Process Neutral Data Modelling, as a consequence of the approach, vertical partitioning, if required, usually occurs on tables with a _TYPE and uses the _TYPE as the partitioning key. Horizontal partitioning happens almost exclusively on the transaction tables and should be based on the _START_DATE, which has lifetime value and is not updated (unlike the _END_DATE which is updated). Horizontal partitioning is not effective and often not supported on MPP platforms that hash the data internally to multiple nodes. Column or vector storage databases render horizontal partitions meaningless as a storage strategy.

Common Data

Individuals Data

Common Data

Organisations Data

Common Data

Organisation Units Data

Common Data For January

Common Data For February

Common Data For March



Page 36

Data Cleansing

Data cleansing itself is outside the scope of this document however the model must make allowances for it. In particular if data is to be cleaned or standardized then the original data must also be stored. To this end every column that is to be modified in this way should have an additional column with the prefix STANDARDIZED_ added to it. For example there may be a column in a table called NAME that has ‘Fred Bloggs’ stored with mixed case, two spaces between the words and a trailing space. The cleaning routine would convert this in such a way as to replace multiple white spaces with a single space character, and then remove leading and training white space before converting the text to uppercase producing ‘FRED BLOGGS’. The result would be stored in the column STANDARDIZED_NAME leaving the original data in NAME. This technique should be used wherever data cleansing takes place. If this column is created then it must always be populated even if the individual row has not changed. This is because the fact that there is no change is information in itself and also to avoid the need on load and extraction to determine whether the original or cleansed data should be used.

Null Values Process neutral data modelling does not require many nulls in the database at all and they should be avoided wherever possible. All _END_DATES must allow nulls. Some _START_DATES will need to allow nulls. The _VALUE columns must also allow nulls. Other than these cases the principles of lifetime value should ensure the data model requires few other columns that allow a null value.

Indexing Strategy

The data warehouse should only be indexed where necessary, i.e. primary and foreign keys and one or two other essential columns for good performance for the extraction of information into data marts. Users are not exploiting the data warehouse and therefore the indexes should be aimed at ensuring that the ETL is as effective as possible.

Enforcing Referential Integrity

Data warehouse projects often have long debates on whether referential integrity should be enforced in the data warehouse. The discussion is centred on the cost of inserts and updates when referential integrity is enabled and therefore slowing down the load of the data warehouse. Whilst on the face of it removing referential integrity is an attractive proposition the question has to be where is the cost in doing so because nothing is free. The cost comes in two places. Firstly in the extra code required to ensure referential integrity outside the database that has to be built into the process before loading and secondly in the cost of handling the data quality issues when something is missed. Where a holistic view of the data warehouse processing is taken, regardless of the data modelling technique used, it becomes apparent that disabling referential integrity is more expensive than enabling it and designing processes to accommodate it. Process Neutral Data Models should always have referential integrity enabled unless there is a specific case for individual tables that means it cannot be done.



Page 37

There is also an approach of disabling referential integrity, loading data and then re-enabling referential integrity. This is acceptable as long as any issues are resolved but in practice many systems ignore issues and ultimately this affects the quality and therefore the longevity of the system. Finally it is possible to write ETL that always complies with referential integrity even when there is missing data, using a technique called ‘Defensive Programming’. For example if a type is missing from a _TYPE table it is possible to write the value into the _TYPE table before inserting the data into the main table. Doing so will create a row in the _TYPE table where the description, group, etc. are set to ‘Unknown’. This allows all data to be processed and data quality metrics to be run (‘How much of my reference data is unknown?’), provide early warning of unplanned changes in the source system and allowing users, via the data maintenance application, to fix reference data in a timely fashion without impacting the load process.

Data Insert versus Data Update

The process neutral data model requires very few updates, the notable exception being the _END_DATE column. This is useful for database platforms that perform better with fewer updates such as the MPP appliance platforms and the column storage/vector platforms. In (traditional) database platforms where insert and update have equal cost the fact that one method is preferred is of no consequence. Where the data warehouse platform favours inserts it is preferable that the processing of the data and any staging that is update intensive is performed in the ETL tool or a dedicated staging database (depending on the architectural constraints and platform choices made by the organisation) outside the data warehouse.

Row versus Set Based Loading in ETL

Most ETL is written to perform Row Based Loading i.e.

For each row For each column Process Data Next column Next row

This technique is common because it uses procedural language techniques familiar to most developers and ETL tools provide procedural language interfaces. However relational databases were written with set theory in mind and have high performance set operator commands such as UNION, MINUS and INSERT that can greatly reduce the processing needed to load any data model but especially process neutral data models. A description of the basic principles of set processing can be found in Appendix 5.



Page 38

Disk Space Utilisation

Using this approach means that initially more disk space is used when reference data and first values of the column-based structure of properties are created when compared to other modelling techniques. Since the change in properties only affects the individual cell rather than the entire row it means that as the data warehouse grows each change uses less space and therefore the total disk space used drops below that used by the other techniques. Over the lifetime of the data warehouse it is unlikely that either approach will see either significant cost or significant savings in disk space.

Implementation Effort

The method chosen for the data modelling can have a significant effect on the effort involved in building a data warehouse. Process Neutral Data Modelling typically has the following characteristics when compared to more traditional approaches:

• Simpler and quicker requirements gathering

This comes about because the major entities and therefore the frame of reference can exist before the detailed data requirements exist and therefore it is possible to use them as a communication tool to aid the gathering of requirements.

• Quick Data Warehouse and Data Mart Design The data warehouse model and data marts using natural star schemas are quickly drawn out of the modelling technique.

• Easy build and configuration of reference data management applications. This effect is seen as a result of the self-similar modelling which significantly reduces the build effort.

• Longer initial build cycles on the ETL. It takes time to develop optimum algorithms for performance and reuse based on a site-specific set of tools and platforms.

• Shorter later build cycles on the ETL The time taken to do later cycles is hugely reduced because of the reuse designed in the earlier stages

• Reduced maintenance costs The long term maintenance cost of data warehouses is rarely measured however the ability of this technique to allow rapid change ensures that maintenance costs will be significantly lower than other approaches

• Simple database sizing. Given the self-similar nature of the data model makes it is easy to size the database. All _TYPE and _BAND tables can be ignored for sizing purposes. It is possible to work out a ratio for the number of rows between the major entity and _PROPERTIES table and the column width is fixed, etc. This greatly reduces the DBA overhead.



Page 39

Data Commutativity In mathematics there is a concept of commutativity42, the ability to change the order of something without changing the end result, for example 2 + 1 is the same as 1 + 2 and is therefore commutative, however 2 - 1 is not the same as 1 – 2 and is therefore not commutative. In general data is not commutative, however it is hierarchical and therefore can be derived in one direction. A common question asked about process neutral data modelling is that with so many places that data can be held which is the right place? The answer is simple: data should be held at the most detailed level possible

Figure 23 - Data Commutativity Since data will be extracted into data marts the ETL that performs the extraction should consolidate the information to the appropriate level for that data mart. It is important to note that this is also a core part of the change management process for the data warehouse. For example the initial system that is used as a source collects data at the segment level. A new system is commissioned to replace the initial source. The new system collects data at the link level. The data can immediately be loaded at the link level and then extracted to the data mart at the segment level. Over time the initial system is de-commissioned and all the information is gathered at the link level. At this point the data mart can be updated to supply the data mart with information at the link, property or event level as required. It is against the DRY (Don’t Repeat Yourself) principle to store the derived data at every level in the data warehouse unless there is some specific added value that is provided by doing so.

42 http://en.wikipedia.org/wiki/Commutative

Links: Detailed knowledge of the relationship e.g. John Smith was married to Jane Smith between 01-Jan-2000 and 01-Jul-2005

Properties: Detailed knowledge of part of the relationship e.g. John Smith was married between 01-Jan-2000 and 01-Jul-2005

Events: Less detailed knowledge of part of the relationship e.g. John Smith’s wedding was on 01-Jan-2000

Segments: Minimal knowledge of the relationship e.g. John Smith was married at some point in time

Dat

a ca

n be

der

ived

Dat

a ca

nnot

be

deriv

ed



Page 40

Data Model Explosion and Compression Two commonly asked questions are:

• How big will this data model get, especially if every major entity has ten supporting tables?43

• Can’t all the type and band tables be put in a single table; actually can we merge all the properties, events, links and segments together into a single table too?

Before answering these two specific questions it is important to make some observations about the process of the data modelling. The objective of a data model is to create a clear, structured environment in which to store data. Every data modeller will have their own preferences for the way in which they design the data model. Process neutral data models strive to find a balance between:

• Re-use of design patterns that provide consistency of information and algorithm • Clarity of model that aids understanding • Size of model that affects maintainability • Performance of the system that affects usability

It is possible for data modellers to change the rules that they apply to the data model however before doing so the data modeller should understand the effect on the overall balance of the system. Projects inevitably fail when the balance is lost and one of these aspects overrides all the others. Projects should always enforce a single data modelling style.

How big does the data model get?

Since the approach uses a design pattern it is possible to design tables and not use them. Experience shows that about fifteen to twenty major entities will be needed with, on average, five supporting tables. This combined with about fifty history and occurrence or transaction tables means the data model will be around one hundred and fifty tables in total. This compares very favourably with other data warehouse models. Large data warehouses that have been in production usually exceed this number. Smaller and newer data warehouses often start with fewer but quickly grow to this sort of size. The advantage of this approach is that nearly everything that comes along in the future has already been designed into the solution, therefore there is no long-term data model size increase provided the model is properly managed.

Can the data model be compressed?

It is possible from an implementation point of view to reduce the number of tables that are implemented by combining, for example, all the _TYPE tables into a single table but this rarely benefits the solution. The data model as described in this document can be indexed and have referential integrity applied. Each table has clear and un-ambiguous meaning. Combining any of these tables loses the transparency of the solution and impacts performance of the solution. It is, with some thought, possible to reduce the entire data model to less that ten tables, it is also virtually impossible to understand the finished data model.

43 A Major Entity could have a type table, a band table, a property table with its own type table, an event table with its own type table, a link table with its own type table and a segment table with its own type table which is a total of ten tables excluding the major entity



Page 41

Which Results to Store? Results are the outcome of some processing performed within the operational system that creates information. A data warehouse will often be faced with a question about whether to store the results or to store sufficient data to reproduce the algorithm and therefore generate the results. For example:

• In a bank the interest calculation • In a telephone company the call rating calculation • In an airline the frequent flyer points • In a manufacturing company the sales person’s commission

In general the purpose of a data warehouse is as a reporting system, a mirror of the information in the operational system, albeit in a different format therefore by design it should take the results as calculated elsewhere and store them to allow reporting. It is also important to understand the complexity of the systems that generate the results. A typical Telco billing process will handle billions of unrated call data records through complex algorithms in order to generate the rated call record and consequently the bill. Furthermore the billing systems allow rapid change in the rules used for billing so that the company can bring new products to market quickly. Given the engineering that has gone into building high performance billing systems and the amount of change in billing requirements it is impractical to try and reliably recreate the billing process in the ETL process. It is therefore important to know what factors were used in the rating of an individual record (e.g. time bands, distance, number types, etc) but not exactly how they were applied. The accurate storing of the results generated elsewhere is the objective of the data warehouse. This approach can be extended to a general principle that data warehouses should store the results of batch processes in source systems rather than try to reproduce the algorithms that generated the result sets. This has consequences for data quality. Users in the example given above might perceive data quality issues if the sum of the rated calls does not equal the billed amount. There are two possible causes for this. The first cause is inaccurate ETL that of course, is a data warehouse problem that has to be resolved. The second cause is an issue in the batch process in the source system. This second type of issue often goes un-detected because users in the operational system look at individual bills, whilst in the data warehouse they are likely to analyse across multiple bills. There is also no simple remedy for this problem. If data is loaded and reconciled against the source system(s) differences will be found. It may not be possible for the data warehouse to resolve them all. Instead they must all be explained and the users of the system educated to understand how the differences between the source systems and the data marts come about. As a result the users may consider changing the source system or business process to get more accurate information.



Page 42

The Holistic Approach Using a process neutral data modelling technique for data modelling enforces a holistic approach to developing the data warehouse solution. There are constant trades between the efficiency of load, storage, and query in the execution of the system. There are also trades between the cost of bespoke development and re-development when compared to the use of convention over configuration techniques. Even if the data models themselves are not used there is much benefit from using the techniques as a method of analysis for the data modelling. Given that major entities have types, bands, properties, events, links and segments it becomes much easier to ensure all the data that might be required has been analysed and discussed in the requirements stage. Unlike most data modelling approaches this method has a basis in the understanding of enterprise architecture and therefore it is possible to tune the data model to get the optimal overall solution for the specific situation because the impact of changing one aspect (e.g. ETL loading) at the expense of another (e.g. data maintenance) can be clearly seem. The holistic approach also requires technical discipline and good change control, as should any other method. The use of this approach often highlights the failure of an organisation in these areas and means that sometimes organisations will chose to use other methods that do not directly expose these failures. Unfortunately hiding these failures does not mean there is no impact, just that the impact is hidden until it becomes critical and causes problems.



Page 43

Summary This white paper has looked at an example company and how the data models of operational systems within that company evolve. Using this example it has been possible to study the impact those changes have on the reporting and data warehousing solutions. To mitigate the impact of these changes the use of a process neutral data model has been examined. This method creates a data model that stores the core business data in a format that is abstracted from the current operational systems. The technique also takes advantage of the benefits of using convention over configuration to define standard format tables and lifetime value principles to implement a DRY or “Don’t Repeat Yourself” concept that makes the data model easily understood. Of course there is no perfect solution to developing a data model and so the implementation issues associated with this technique are also examined. Combining the techniques described in the white paper allows a data model to be quickly and easily developed that is easily understood and that will lower the total cost of ownership because it is not so susceptible to change.



Page 44

Appendix 1 – Data Modelling Standards The data modelling standards outlined below are the ones used by Data Management & Warehousing. Where a choice has been made (plural vs. singular for example) it is not important what the choice is, but it is important that a choice has been made and that it is documented and enforced.

General Conventions All table and column names must use uppercase letters, the digits 0-9 and an underscore ‘_’ to replace a space. No other characters are allowed. This is for database compatibility reasons44. Table and column names must be no longer than 30 characters including underscores. This is for database compatibility reasons Table names and column names should be in English, this is because regardless of where in the world the system is operating the majority of source systems will have English table names and the amount of time lost trying to translate and match tables and column names compared with the visual inspection and quick comparison in the same language is significant45

Table Conventions

Table Names are always plural; a table is a collection of zero, one or more rows

Every table should have a short name or alias. The short name is created using the following rules:

Every short name is six characters long. If a table name is less than six characters the short name is the table name right padded with ‘Z’ until it is six characters long E.g. BILLS becomes BILLSZ If a table name is made up of one word of six or more characters then the short name is the first six characters E.g. ACCOUNTS becomes ACCOUN If a table name is made up of two words then the first three characters of each word are used to create the short name E.g. ACCOUNT_TRANSACTIONS becomes ACCTRA If a table name is made up of three words then the first two characters of each word are used to create the short name E.g. CALL_DISTANCE_BAND becomes CADIBA If a table name is made up of four words then the first two characters of the first two words and the first character of the third and fourth words are used to make up the short name E.g. CALL_DISTANCE_BAND_GROUPS becomes CADIBG

44 Database Identifier Lengths Comparison https://test.kuali.org/confluence/display/KULRICE/Database+Table+and+Column+Name+Standards 45 Taking this further there are minor differences between UK and US English (e.g. COLOUR vs. COLOR) and therefore strictly speaking the data model should be in US English.



Page 45

If a table name is made up of five words then the first two characters of the first word and the first character of the second, third, fourth and fifth words are used to make up the short name E.g. THE_QUICK_BROWN_FOX_JUMPED becomes THQBFJ If a table name is made up of six or more words then the first character of each of the first six words is used to make up the short name E.g. THE_QUICK_BROWN_FOX_JUMPED_OVER becomes TQBFJO If there are any conflicts as a result of this then they should be resolved and documented by the data modeller.

Table Suffixes. There are a series of table name suffixes that are reserved for specific functions; these are:

_TYPES [Alternate shorter name _TYP] A _TYPES table provides a classification of the associated table data into discrete values. The singular form of the table that is being classified always prefixes the _TYPES. (e.g. PARTIES is classified by PARTY_TYPES, PARTY_PROPERTIES is classified by PARTY_PROPERTY_TYPES, etc.). Where the table being classified has more than one classification the attribute being classified is added between the table name and the _TYPES (e.g. PARTY_GENDER_TYPES is classifying GENDER in the PARTIES table). A _TYPES table can be associated with any other table except for another _TYPES or a _BANDS table. _BANDS (a _TYPES special case) [Alternate shorter name _BAN] A _BANDS table provides a classification of the associated table data into a range of values. The singular form of the table that is being classified always prefixes the _BANDS. Where the table being classified has more than one classification the attribute being classified is added between the table name and the _BANDS. A _BANDS table can be associated with any other table except for the following type: _TYPES, _BANDS, _PROPERTIES, _EVENTS, _LINKS, _SEGMENTS.

_PROPERTIES [Alternate shorter name _PRO] A _PROPERTIES table provides time variant data storage support for non-lifetime value attributes of a major entity. The singular form of the table that is being supported always prefixes the _PROPERTIES. (e.g. PARTIES is supported by PARTY_PROPERTIES, PRODUCTS is supported by PRODUCT_PROPERTIES, etc.). _PROPERTIES tables can only be associated with major entities and always have a related _TYPES table. _EVENTS (a _PROPERTIES special case) [Alternate shorter name _EVE] A _EVENTS table provides time variant data storage support for non-lifetime value attributes of a major entity that occur more than once but at a point in time rather than over a period of time (which is covered by _PROPERTIES). The singular form of the table that is being supported always prefixes the _EVENTS. (e.g. PARTIES is supported by PARTY_EVENTS, PRODUCTS is supported by PRODUCT_EVENTS, etc.) _EVENTS tables can only be associated with one major entity table and always have a related _TYPES table.



Page 46

_LINKS [Alternate shorter name _LIN] A _LINKS table provides a time-variant peer-to-peer relationship support between two records within the same major entity. The singular form of the table that is being supported always prefixes the _LINKS. (e.g. PARTIES is supported by PARTY_LINKS, PRODUCTS is supported by PRODUCT_LINKS, etc.) _LINKS tables can only be associated with major entities and always have a related _TYPES table.

_SEGMENTS [Alternate shorter name _SEG]

A _SEGMENTS table provides a time-variant peer group support for records within the same major entity. The singular form of the table that is being supported always prefixes the _EVENTS. (e.g. PARTIES is supported by PARTY_SEGMENTS, PRODUCTS is supported by PRODUCT_SEGMENTS, etc.) _SEGMENTS tables can only be associated with major entities and always have a related _TYPES table.

_HISTORY [Alternate shorter name _HIS]

A _HISTORY table provides a time-variant peer-to-peer relationship support between two records in different major entities. The singular form of each of the two major entity tables that are being supported always prefixes the _HISTORY. (e.g. PARTY and GEOGRAPHY is supported by PARTY_GEOGRAPHY_HISTORY, etc.) _HISTORY tables can only be associated with two major entities and always have a related _TYPES table.

Column Conventions Column Names are always singular; a column is a single element within a row

Stand Alone Column There are a number of columns that are added to every table

TIMESTAMP A timestamp for each record is held. This is either the date and time that the row was created, or subsequently when it was last modified. If two systems update any part of a row within one load process only the last modification is preserved, and no count of modifications is maintained. The data type of a timestamp must be TIMESTAMP where supported by the database or DATE otherwise.

ORIGIN This is used to identify what made the last change to the record. This should be the name of the ETL process or mapping that performed the insert or last update. If two systems update any part of a row within one load process only the last updating system is preserved, and no count of modifications is maintained. It is important to note that the origin only reflects the last process in the chain to insert or update a record. A record comes from possibly multiple sources system passing through many ETL processes before being inserted into the database. The ORIGIN is set to the last ETL process and the ETL tool must then contains the audit trail back to the previous system, and so on. The data type and format of the ORIGIN column must be VARCHAR(32)46. This approach is known as tracking the data lineage

46 This document uses ANSI SQL92 standards. Other databases may use other data types e.g. Oracle would VARCHAR2(32)



Page 47

Column Suffixes Standard extensions added to column names

_DWK The use of _DWK indicates a Data Warehouse Key – a key generated and maintained within the Data Warehouse, allowing the use of the words _ID, _CODE, _NUMBER, etc to denote identifiers brought in from the source data. Every table must use a _DWK surrogate key rather than any source system key that may change when the source system is changed. All _DWK are integer data type.

_TIME Any field that has the suffix _TIME must contain a time. This information is stored in a TIME data type if available, otherwise it is stored in the DATE data type with the date component set to ‘01-JAN-1900’. This is to allow arithmetic to be performed on time fields.

_DATE Any field that has the suffix _DATE must contain information stored in the DATE data type.

_START_DATE

The _START_DATE can have two types of value:

Value Type Meaning Start Date before the current date An event that has actually happened Start Date after the current date An event that is certain to happen at a point

in the future Figure 24 - _START_DATE Rules

It should be noted that it is impossible to obtain some _START_DATE and therefore whilst not strictly compliant with the definition a NULL might have to be allowed to represent unknown data. The alternative is to enter a default value but this is to be avoided as it may bias aggregate results.

_END_DATE

The _END_DATE can have three types of value:

Value Type Meaning Null A status with no planned change of status End Date before the current date An event that has actually happened End Date after the current date An event that is planned to happen. Figure 25 - _END_DATE Rules

_EVENT_DATE

A _EVENT_DATE is always found in a _EVENTS table instead of a _START_DATE and a _END_DATE. It represents the date on which the event took place



Page 48

_DESC The description fields are free text fields that describe the record. This should not be relied on for queries and instead keys and appropriate joins used. The standard data type and format for a description is VARCHAR (255). _NUMERIC_VALUE Holds a floating point number for use in _PROPERTY, _LINK, _SEGMENT and _HISTORY tables _TEXT_VALUE Holds VARCHAR(255) text for use in _PROPERTY, _LINK and _HISTORY tables

Column Data Type and Sizes Short text columns should be VARCHAR(32) Long text columns should be VARCHAR(255) Numbers should be INTEGER unless specifically required to be otherwise Dates should be have a data type of DATE Times should have a data type of TIME where supported, otherwise DATE MEMO, LONG, BLOB, CLOB should be avoided at all costs

Column Prefixes There are also a number of standard column prefixes

STANDARDIZED_ Standardized fields are fields that have been cleaned in some way CURRENT_ The CURRENT_ prefix denotes a current value that might not have lifetime value in a major entity such as SURNAME in the PARTIES table. PREVIOUS_ The PREVIOUS_ prefix denotes value of a field held in a CURRENT_ field prior to the last update. No further history is kept of this value; it is always the value before the one held in the CURRENT_ field. LINKED_ Used where two foreign keys from the same table are used in a _LINK table.

Column Null / Not Null

All columns should be NOT NULL unless otherwise specified (e.g. in _END_DATE and _VALUE columns)



Page 49

Column Name Abbreviations

Due to the 30-character limit occasionally column names have to be abbreviated. The following abbreviations are acceptable for suffixes:

Abbreviation Long Description _B _BAND _BSV _BAND_START_VALUE _BEV _BAND_END_VALUE _BD _BAND_DESC _BG _BAND_GROUP _BSD _BAND_START_DATE _BED _BAND_END_DATE _T _TYPE _TD _TYPE_DESC _TG _TYPE_GROUP _TSD _TYPE_START_DATE _TED _TYPE_END_DATE _PSD _PROPERTY_START_DATE _PED _PROPERTY_END_DATE _PNV _PROPERTY_NUMERIC_VALUE _PTV _PROPERTY_TEXT_VALUE _PT _PROPERTY_TYPE _PTD _PROPERTY_TYPE_DESC _PTG _PROPERTY_TYPE_GROUP _PTSD _PROPERTY_TYPE_START_DATE _PTED _PROPERTY_TYPE_END_DATE _ED _EVENT_DATE _ENV _EVENT_NUMERIC_VALUE _ETV _EVENT_TEXT_VALUE _ET _EVENT_TYPE _ETD _EVENT_TYPE_DESC _ETG _EVENT_TYPE_GROUP _ETSD _EVENT_TYPE_START_DATE _ETED _EVENT_TYPE_END_DATE _LSD _LINK_START_DATE _LED _LINK_END_DATE _LNV _LINK_NUMERIC_VALUE _LTV _LINK_TEXT_VALUE _LT _LINK_TYPE _LTD _LINK_TYPE_DESC _LTG _LINK_TYPE_GROUP _LTSD _LINK_TYPE_START_DATE _LTED _LINK_TYPE_END_DATE _SSD _SEGMENT_START_DATE _SED _SEGMENT_END_DATE _SNV _SEGMENT_NUMERIC_VALUE _STV _SEGMENT_TEXT_VALUE _ST _SEGMENT_TYPE _STD _SEGMENT_TYPE_DESC _STG _SEGMENT_TYPE_GROUP _STSD _SEGMENT_TYPE_START_DATE _STED _SEGMENT_TYPE_END_DATE _HSD _HISTORY_START_DATE _HED _HISTORY_END_DATE _HNV _HISTORY_NUMERIC_VALUE _HTV _HISTORY_TEXT_VALUE _HT _HISTORY_TYPE _HTD _HISTORY_TYPE_DESC _HTG _HISTORY_TYPE_GROUP _HTSD _HISTORY_TYPE_START_DATE _HTED _HISTORY_TYPE_END_DATE

Figure 26 - Column Name Abbreviations



Page 50

The following abbreviations are acceptable for prefixes:

Abbreviation Long Description CUR_ CURRENT_ LIN_ LINKED_ PRE_ PREVIOUS_ STA_ STANDARDIZED_

For large projects and models it is sometimes useful to consider using all abbreviations from the outset.

Index Conventions Where databases use administrator-defined indexes the following conventions should be used.

Primary Key Index

Primary Key indexes should be named PK_XXXXXX where XXXXXX is the six-character short table name

Foreign Key Index

Foreign Key indexes should be named FK_XXXXXX_YYYYYY_N where XXXXXX is the six-character short table name of the table with the primary key, YYYYYY is the six-character short table name of the table with the foreign key and N represents a sequence number between 1 and 9 for the index

Unique Key Index

Unique Key indexes should be named UK_XXXXXX_N where XXXXXX is the six-character short table name of the table being indexed and N represents a sequence number between 1 and 9 for the index

Non-Unique Key Index

Non-unique Key indexes should be named NK_XXXXXX_N where XXXXXX is the six-character short table name of the table being indexed and N represents a sequence number between 1 and 9 for the index. Note that if more than nine indexes are needed then there is something wrong.

Standard Table Constructs

The following provide the standard table definitions for each table constructs

_TYPES Column Data Type Length Optional _TYPE_DWK INTEGER NOT NULL _TYPE VARCHAR 32 NOT NULL _TYPE_DESC VARCHAR 255 NOT NULL _TYPE_GROUP VARCHAR 32 NOT NULL _TYPE_START_DATE DATE NOT NULL _TYPE_END_DATE DATE NULL TIMESTAMP DATE NOT NULL ORIGIN VARCHAR 32 NOT NULL Figure 27 - Standard _TYPES table



Page 51

_BANDS Column Data Type Length Optional _BAND_DWK INTEGER NOT NULL _BAND VARCHAR 32 NOT NULL _BAND_START_VALUE NUMBER NOT NULL _BAND_END_VALUE NUMBER NULL _BAND_DESC VARCHAR 255 NOT NULL _BAND_GROUP VARCHAR 32 NOT NULL _BAND_START_DATE DATE NOT NULL _BAND_END_DATE DATE NULL TIMESTAMP DATE NOT NULL ORIGIN VARCHAR 32 NOT NULL

Figure 28 - Standard _BANDS table

_PROPERTIES Column Data Type Length Optional _DWK INTEGER NOT NULL _PROPERTY_TYPE_DWK INTEGER NOT NULL _PROPERTY_TEXT_VALUE VARCHAR 32 NULL _PROPERTY_NUMERIC_VALUE NUMBER NULL _PROPERTY_START_DATE DATE NOT NULL _PROPERTY_END_DATE DATE NULL TIMESTAMP DATE NOT NULL ORIGIN VARCHAR 32 NOT NULL

Figure 29 - Standard _PROPERTIES table

_EVENTS Column Data Type Length Optional _DWK INTEGER NOT NULL _EVENT_TYPE_DWK INTEGER NOT NULL _EVENT_TEXT_VALUE VARCHAR 32 NULL _EVENT_NUMERIC_VALUE NUMBER NULL _EVENT_DATE DATE NOT NULL TIMESTAMP DATE NOT NULL ORIGIN VARCHAR 32 NOT NULL

Figure 30 - Standard _EVENTS table

_LINKS Column Data Type Length Optional _DWK INTEGER NOT NULL LINKED_ _DWK INTEGER NOT NULL _LINK_TYPE_DWK INTEGER NOT NULL _LINK_TEXT_VALUE VARCHAR 32 NULL _LINK_NUMERIC_VALUE NUMBER NULL _LINK_START_DATE DATE NOT NULL _LINK_END_DATE DATE NULL TIMESTAMP DATE NOT NULL ORIGIN VARCHAR 32 NOT NULL

Figure 31 - Standard _LINKS table



Page 52

_SEGMENTS Column Data Type Length Optional _DWK INTEGER NOT NULL _SEGMENT_TYPE_DWK INTEGER NOT NULL _SEGMENT_TEXT_VALUE VARCHAR 32 NULL _SEGMENT_NUMERIC_VALUE NUMBER NULL _SEGMENT_START_DATE DATE NOT NULL _SEGMENT_END_DATE DATE NULL TIMESTAMP DATE NOT NULL ORIGIN VARCHAR 32 NOT NULL

Figure 32 - Standard _SEGMENTS table

_HISTORY Column Data Type Length Optional _DWK INTEGER NOT NULL _DWK INTEGER NOT NULL _HISTORY_TYPE_DWK INTEGER NOT NULL _HISTORY_TEXT_VALUE VARCHAR 32 NULL _HISTORY_NUMERIC_VALUE NUMBER NULL _HISTORY_START_DATE DATE NOT NULL _HISTORY_END_DATE DATE NULL TIMESTAMP DATE NOT NULL ORIGIN VARCHAR 32 NOT NULL

Figure 33 - Standard _HISTORY table

Sequence Numbers For Primary Keys

The major entities require a sequence number to populate the _DWK field. Each major entity should have it’s own sequence The _TYPES and _BANDS tables all need a sequence to populate their _DWK field. This should be a single sequence that is shared amongst all _TYPES and _BANDS tables. This has two effects; it prevents a larger number of sequences being created than necessary and also means that reference data cannot inadvertently be joined to other reference data _PROPERTIES, _EVENTS, _SEGMENTS and _HISTORIES do not need a primary key. Occurrence or transaction tables do not normally need a primary key. If they do then like major entities each one should have it’s own sequence It is often asked if the CALENDAR table should have a _DWK column or if the date is sufficient. Either approach will work however for consistency the use of a _DWK is preferred.47

47 Some organisations compromise by using Julian Day Number i.e. the integer part of the date (see http://en.wikipedia.org/wiki/Julian_day) as a surrogate key that obscures the underlying information from the users but aids development. This does, from time to time risk inconsistencies.



Page 53

Appendix 2 – Understanding Hierarchies Hierarchies are an essential part of any reporting system and yet there are two common mistakes that regularly affect their implementation.

Sales Regions

Most business will have a ‘geographic’ structure of some type - sales region being a prime example. The title ‘sales region’ and the names of the elements (e.g. country names, state names, city names) adds to the confusion in implying that this is a geographic hierarchy but this is wrong; it is an organisational hierarchy. Whilst in concept the business allocated resource to cover different geographic regions the practicalities of running the business soon overtake the situation. A series of exemptions are soon created, some accounts being looked after by people out of region and some accounts being looked after by non-geographic functions. There is no direct geographical relationship, just a use of geographic names for familiarity. It is possible to associate the organisational structure via a history table to the addresses of clients but this is of little value. The requirement should be to accurately report the hierarchy and not to become the system of record for how the sales teams are organised. It is important to remember that management teams will subjectively change this structure as their resources permit.

Internal Organisation Structure The second common mistake is to treat roles and individuals within an organisation structure as being synonymous. Below is a typical organisation chart:

Figure 34 - Typical Organisation Hierarchies This is often stored as Jack Doe reporting to John Smith, etc. However the people in the organisation structure are not the hierarchy. What needs to be stored is the role as an organisation unit:

Figure 35 - Stored Organisational Hierarchy



Page 54

The role hierarchy is significantly less dynamic than the people within it and the organisation changes are much more controlled as the business chooses when to re-structure but does not choose when staff join or leave. Most large organisations will have a personnel or human resources department that manages the organisation hierarchy and if they use a Human Resource Management System it is likely that every role will have a unique ID and a documented position in the hierarchy. It is then possible to relate the roles as organisational units to the individuals so:

Individual Type Organisational Unit John Smith Works as Sales Manager Jack Doe Works as Sales Executive 1 etc.

Figure 36 - Relating Individuals to Roles This has the added advantage of dealing with temporary resources and also with the transition of resources (e.g. when someone is moving from one team to another and fulfils two roles for a short period of time).



Page 55

Appendix 3 – Industry Standard Data Models A number of organisations offer industry standard data models. These can be broken down into two types of provider:

• Vendors such as IBM, Oracle, Sybase and Teradata, who all provide some form of standard data models for some industry sectors. These models usually started from a client project and then a period of refinement internally before becoming ‘productized’

• Industry Organisations such as TMForum in the telecommunications industry that have decided that there is value in building an industry wide common data model

Both types of provider usually provide logical models and these can be used in one of two ways:

• As a reference data model used for the accumulated industry knowledge • As a real implementation data model

A logical data model will need to be converted to a physical model and it is in this conversion process when the physical data model is created that process neutral data modelling can be used. As an example one of the best described and available data models is the Information Framework (SID)48 from TMForum.org49 an industry association focused on transforming business processes, operations and systems for managing and monetizing on-line Information, Communications and Entertainment services. The Information Framework provides the foundation of a "common language" that allows common representation, as well as a standardized meaning for the relationships that exist among logical entities. For example, a common definition of what is "customer" and how It relates to other elements, such as mailing address, purchase order, billing records, trouble tickets, and so on. This is an ideal basis for a process neutral data model as there are defined set of major entities with lifetime values and relationships The model is broken down into a number of domains:

• Market / Sales • Product • Customer • Service • Resources • Supplier / Partner • Common Business • Enterprise

Within these there are a number of subject areas

48 TMForum Information Framework (SID) http://www.tmforum.org/InformationFramework/1684/home.html 49 TMForum: http://www.tmforum.org/browse.aspx



Page 56

Figure 37 - TMForum Information Framework (SID) Version 8.0 overview50 At first glance this appears to offer up a view of the world incompatible with the design objectives of a process neutral data model (e.g. because there is a customer and a supplier rather than just a party). This is an incorrect assumption. There are two ways in which the Information Framework can be used with the approach described in this document. The first method is to trust the Information Framework and implement it as a process neutral data model. Therefore there is no major entity called Party, instead there is one called Customer and one called Supplier. This approach trusts the reference model to have thought through the industry specific lifetime value issues and be satisfied that it will be fit for purpose. In the specific case of the TMForum Information Framework this is a safe assumption as it is widely peer reviewed and widely used by industry experts. This is however not always true of all vendor data models. The second method is to use the Information Framework as a point of reference and create a process neutral data model that meets all the described entities and attributes of the Information Framework. In this case a Party entity would exist and the attributes and associated properties, links, etc. would be validated to ensure that all information held in the Information Framework could be stored in the resulting data model. Both of these approaches have been successfully used with the TMForum Information Framework and could be used with other industry standard data models. The choice of approach will often depend on the quality of the reference data model, the likelihood of change and the needs of the business.

50 The SID model is copyright TMForum.org and was taken from http://www.tmforum.org/sdata/content/PracticesStandards/sid/default.aspx



Page 57

Appendix 4 – Information Sparsity Once all the processing is complete and data quality issues resolved how much information does a data warehouse have? The answer is inevitably less than the organisation believes. The rise of social networking sites has helped to develop understanding in this area. It shows that most people (80%) have less than 100 friends51 and people do not know much as they think about those friends. One simple test is to assume that you have 100 friends and assess for what percentage of your friends you know the following information:

• First Name & Last Name • Middle Names • Birthday (e.g. 6 July) • Date of Birth (6 July 1966) • Partners Name • Home Address • Home Telephone Number • Home e-Mail Address • Work Address • Work Telephone Number • Work e-Mail Address • Mobile Number • All of the above

The chances are that are that you will not know the answer for 100% of your friends to any question. (What is Mrs Smith’s first name, she lives two doors down and looks after the cat when you are away?) The situation is also one that deteriorates rapidly. As a result of reading this white paper you might decide to contact your 100 best friends and get all the above information. Your friends are tolerant of your request and provide you will all this information. In six months time you decide to update your address book and you contact your tolerant friends again to check that all the details are still correct. The chances are that at least twenty percent of your friends will have changed some part of the information over the six months.52 The use of synchronisation tools, social network and personal address book sites53 has improved the automation of change notification. It is now possible to update your own details on a service and for that to automatically update the records of your friends who also use the service but the change rate is still high.

51 From a survey by RapLeaf http://www.marketingvox.com/more-women-than-men-on-social-networks-have-more-friends-than-men-do-038384/ 52 The percentage is variable with age and socio-economic factors. Middle age and high incomes improve stability and reduce the percentage of change. Youth and older age as well as lower incomes increase the amount of change. This was dramatically demonstrated in the UK with the introduction of the community charge or “poll tax” (http://en.wikipedia.org/wiki/Community_Charge) between 1990 and 1993. Local authorities were responsible for collecting the basic household information and struggled to maintain an accurate list of households. Whilst there was quite a lot of deliberate avoidance that cannot be factored in there was also regularly 20% of notified change in any single month. 53 Sites include Facebook, Bebo, Plaxo, LinkedIn and Naymz. Many of these sites are now adding features that allow you to better qualify and quantify these friends into true friends, acquaintances, etc.



Page 58

This issue transfers into the data warehouse environment. If an individual cannot keep track of their friends then how does a business keep track of their customers? Businesses only get informed of changes when the customer requires something. For example if you register a ‘pay as you go’ mobile telephone and when doing so are required to provide an address do you bother to update their records when you change address? However if you later need something sent from the telephone company then you will contact them to update their records. One Telco data warehouse team attempted to measure how poor the address data was. They decided to look at post-paid customers who receive a printed bill each month. The method was obvious and simple once it was identified. They went to the mailroom and asked how many bills were returned by the postal service. The answer was about 25,000 per month or 1% of the bills generated. They also discovered that a team handled the returned mail and by various methods updated them in the main billing system. Therefore each month 1% of the post-paid customer data expired. Pre-paid customers, the much larger proportion of total customer base, would have much less reason to update their information and therefore a much larger percentage. The process neutral data model aids this situation in two simple ways:

• The first way in which it helps is that there is a separate record for each piece of information (home address is stored separately from work address within the PARTY_ADDRESS_HISTORY, etc.) and this means that it is easy to maintain each of different pieces of information without impacting other pieces of information.

• The second way in which it helps is that each piece of information has its own TIMESTAMP. It is therefore possible to exclude information based on the age of the information.



Page 59

Appendix 5 – Set Processing Techniques This appendix outlines the basic principles involved in writing set based techniques, it is not comprehensive but provides the basic flow used in the technique. This technique assumes that the database supports set operators and that there is a staging area where a previous copy of the table can be held (on the first day the previous copy exists but is empty). It is not an exhaustive description. Change capture techniques such as this offer some of the biggest opportunities to improve data warehouse load times In this example the table is called TABLE

1. PREVIOUS_TABLE exists from previous run of the process or is created empty for the very first run

2. A copy of the source system table is taken to create CURRENT_TABLE

3. The INS_UPD_TABLE is created as CURRENT_TABLE MINUS PREVIOUS_TABLE

4. The DEL_UPD_TABLE is created as PREVIOUS_TABLE MINUS CURRENT_TABLE

6. The DEL_TABLE is created as DEL_UPD_TABLE MINUS INS_UPD_TABLE

5. The INS_TABLE is created as INS_UPD_TABLE MINUS DEL_UPD_TABLE

7. The UPD_TABLE is created as DEL_UPD_TABLE INTERSECT INS_UPD_TABLE

8. The PREVIOUS_TABLE is dropped and the CURRENT_TABLE renamed to PREVIOUS_TABLE

Inserted as new records into the

appropriate tables

Updates the end-date on existing records in the appropriate tables

Processed to end-date existing records

and create new records in the

appropriate tables Figure 37 - Set Processing Technique

PREVIOUS_ TABLE

CURRENT_ TABLE

INS_UPD_ TABLE

DEL_UPD_ TABLE

INS_ TABLE

DEL_ TABLE

UPD_ TABLE

PREVIOUS_ TABLE

INS_ TABLE

DEL_ TABLE

UPD_ TABLE



Page 60

Appendix 6 – Standing on the shoulders of giants "Bernard of Chartres used to say that we are like dwarfs on the shoulders of giants, so that we can see more than they, and things at a greater distance, not by virtue of any sharpness of sight on our part, or any physical distinction, but because we are carried high and raised up by their giant size."54 Process Neutral Data Modelling may seem a large leap from more widely discussed methods of data modelling for a data warehouse but it has been used for over fifteen years by some of the largest organisations in the world. The techniques in this document have been influenced by a number of people:

• Ralph Kimball Creator of the data mart concept and the need to deliver simple, easy to use information to business users

• Bill Inmon Known as the father of data warehousing whose approach required a normalised database in which to store the lowest level of information

• Paul Winder (formally of Oracle Corp) Creator of Oracles’ Telco Reference Data Model and responsible for the abstraction of major entities within that model to allow it to be used across many different Telcos

• Ward Cunningham Owner of c2.com (home of the Portland Pattern Repository) and signatory of the agile manifesto. Ward also worked with a number of the author’s contemporaries at Sequent Computers in the early and mid 1990s.

• David Heinemeier Hansson Ruby on Rails designer who conceived and implemented many of the concepts used in “convention over configuration” from a coding perspective.

• Andy Hunt and Dave Thomas Authors of the book The Pragmatic Programmer in which DRY (Don’t Repeat Yourself) is a core principle.

And many others who will no doubt feel that they should have been included and to whom the author can only apologise for their omission.

54 This quote, often attributed to Isaac Newton, is by John of Salisbury, from his 1159 Metalogicon. http://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants



Page 61

Further Reading Data Management & Warehousing have published a number of white papers on data warehousing and related issues. The following papers are available for download from http://www.datamgmt.com

Overview Architecture for Enterprise Data Warehouses

This is the first of a series of papers published by Data Management & Warehousing to look at the implementation of Enterprise Data Warehouse solutions in large organisations using a design pattern approach. A design pattern provides a generic approach, rather than a specific solution. It describes the steps that architecture, design and build teams will have to go through in order to implement a data warehouse successfully within their business This particular document looks at what an organisation will need in order to build and operate an enterprise data warehouse in terms of the following: * The framework architecture What components are needed to build a data warehouse, and how do they fit together * The toolsets What types of products and skills will be used to develop a system * The documentation How do you capture requirements, perform analysis and track changes in scope of a typical data warehouse project. This document is, however, an overview and therefore subsequent documents deal with specific issues in detail

Data Warehouse Governance

An organisation that is embarking on a data warehousing project is undertaking a long-term development and maintenance programme of a computer system. This system will be critical to the organisation and cost a significant amount of money, therefore control of the system is vital. Governance defines the model the organisation will use to ensure optimal use and re-use of the data warehouse and enforcement of corporate policies (e.g. business design, technical design, and application security) and ultimately derive value for money.

This paper has identified five sources of change to the system and the aspects of the system that these sources of change will influence in order to assist the organisation to develop standards and structures to support the development and maintenance of the solution. These standards and structures must then evolve, as the programme develops to meet its changing needs.

“Documentation is not understanding, process is not discipline, formality is not skill”

The best governance must only be an aid to the development and not an end in itself. Data Warehouses are successful because of good understanding, discipline and the skill of those involved. On the other hand systems built to a template without understanding, discipline and skill will inevitably deliver a system that fails to meet the users’ needs and sooner rather than later will be left on the shelf, or maintained at a very high cost but with little real use.



Page 62

Data Warehouse Project Management Data warehouse projects pose a specific set of challenges for the project manager. Whilst most IT projects are a development to support a well defined pattern of work a data warehouse is, by design, there to support users asking ad hoc questions of the data available to the business. It is also a project that will have more interfaces and more change than any other system within the organisation. Projects often have poorly set expectations in terms of timescales; the likely return on investment, the vendors’ promises for tools or the expectations set between the business and IT within an organisation. They also have large technical architectures and resourcing issues that need to be handled. This document will outline the building blocks of good project control including the definition of phases, milestones, activities, tasks, issues, enhancements, test cases, defects and risks and will discuss how they can be managed, and when, using an event horizon, the project manager can expect to get information. To help manage these building blocks this paper will look at the types of tools and technology that are available and how they can be used to assist the project manager. It also looks at how these tools fit into methodologies. The final section of the paper has looked at how effective project leadership and estimating can improve the chances of success for a project. This includes understanding the roles of the executive sponsor, project manager, technical architect and senior business analyst along with the use of different leadership styles, organisational learning and team rotation.

Data Warehouse Documentation Roadmap

All projects need documentation and many companies provide templates as part of a methodology. This document describes the templates, tools and source documents used by Data Management & Warehousing. It serves two purposes:

• For projects using other methodologies or creating their own set of documents to use as a checklist. This allows the project to ensure that the documentation covers the essential areas for describing the data warehouse. • To demonstrate our approach to our clients by describing the templates and deliverables that are produced.

Documentation, methodologies and templates are inherently both incomplete and flexible. Projects may wish to add, change, remove or ignore any part of any document. Some may also believe that aspects of one document would sit better in another. If this is the case then users of this document and these templates are encouraged to change them to fit their needs.

Data Management & Warehousing believes that the approach or methodology for building a data warehouse should be to use a series of guides and checklists. This ensures that small teams of relatively skilled resources developing the system can cover all aspects of the project whilst being free to deal with the specific issues of their environment to deliver exceptional solutions, rather than a rigid methodology that ensures that large teams of relatively unskilled staff can meet a minimum standard.



Page 63

How Data Works

Every business believes that their data is unique. However the storage and management of that data uses similar methods and technologies across all organisations. As a result the same issues of consistency, performance and quality occur across all organisations. The commercial difference between organisations is not whether they have data issues but how they react to them in order to improve the data.

This paper examines how data is structured and then examines characteristics such as the data model depth, the data volumes and the data complexity. Using these characteristics it is possible to look at the effects on the development of reporting structures, the types of data models used in data warehouses, the design and build of interfaces (especially ETL for data warehouses), data quality and query performance. Once the effects are understood it is possible for programmes and projects to reduce (but never remove) the impact of these characteristics resulting in cost savings for the business.

This paper also introduces concepts created by Data Management & Warehousing including:

• Left to right entity diagrams • Data Model Depth • Natural Star Schemas • The Data Volume and Complexity graph • Incremental Phase Benefit Model



Page 64

List of Figures Figure 1 - Initial Operational System Data Model...................................................................... 6 Figure 2 - Initial Reporting System Data Model......................................................................... 7 Figure 3 - Second Version Operational System Data Model..................................................... 8 Figure 4 - The Sales Funnel .................................................................................................... 10 Figure 5 - Example data for PARTY_TYPES .......................................................................... 17 Figure 6 - Example Data for GEOGRAPHY_TYPES .............................................................. 18 Figure 7 - Example data for TIME_BANDS............................................................................. 19 Figure 8 - Party Properties Example ....................................................................................... 20 Figure 9 - Example Party Property Data ................................................................................. 20 Figure 10 - Example data for PARTY_PROPERTIES............................................................. 21 Figure 11 - Example Data for PARTY_PROPERTY_TYPES.................................................. 21 Figure 12 - Example Data for PARTY_PROPERTIES ............................................................ 21 Figure 13 - Example Data for PARTY_PROPERTIES ............................................................ 22 Figure 14 - Party Events Example........................................................................................... 22 Figure 15 - Party Links Example ............................................................................................. 23 Figure 16 - Party Segments Example ..................................................................................... 24 Figure 17 – Party Geography History Example....................................................................... 26 Figure 18 - The Example Bank Data Model ............................................................................ 31 Figure 19 - Volume & Complexity Correlations ....................................................................... 32 Figure 20 - PARTIES view mapping........................................................................................ 34 Figure 21 - Vertically Partitioned Data..................................................................................... 35 Figure 22 - Horizontally Partitioned Data ................................................................................ 35 Figure 23 - Data Commutativity............................................................................................... 39 Figure 24 - _START_DATE Rules .......................................................................................... 47 Figure 25 - _END_DATE Rules............................................................................................... 47 Figure 26 - Column Name Abbreviations ................................................................................ 49 Figure 27 - Standard _TYPES table ........................................................................................ 50 Figure 28 - Standard _BANDS table ....................................................................................... 51 Figure 29 - Standard _PROPERTIES table ............................................................................ 51 Figure 30 - Standard _EVENTS table ..................................................................................... 51 Figure 31 - Standard _LINKS table ......................................................................................... 51 Figure 32 - Standard _SEGMENTS table ............................................................................... 52 Figure 33 - Standard _HISTORY table.................................................................................... 52 Figure 34 - Typical Organisation Hierarchies .......................................................................... 53 Figure 35 - Stored Organisational Hierarchy........................................................................... 53 Figure 36 - Relating Individuals to Roles................................................................................. 54 Figure 37 - TMForum Information Framework (SID) Version 8.0 overview............................. 56 Figure 37 - Set Processing Technique .................................................................................... 59

Copyright © 2009 Data Management & Warehousing. All rights reserved. Reproduction not permitted without written authorisation. References to other companies and their products use trademarks owned by the respective companies and are for reference purposes only. Some terms and definitions taken from Wikipedia Crossword Answer: Expert Gives Us Real Understanding - GURU

Date post:	15-Nov-2014
Category:	Documents
Upload:	davidmwalker
View:	107 times
Download:	0 times

White Paper - Process Neutral Data Modelling

Documents