WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING ... · WHY DATA ANALYTICS INITIATIVES FAIL:...

©2015

WHY DATA ANALYTICS INITIATIVES FAIL:

TIPS FOR BUILDING SUCCESSFUL PROGRAMS

You’ve read the books, taken the classes, and have a strong grasp on the queries, reports, and

analytics you want to run against your data to help detect indicators of potential fraud. So why do

various studies suggest that less than 5 percent of all fraudulent or illegal activities are detected

through automated software within an organization? Through real-life examples this session will

discuss obstacles that might endanger your program’s success, such as dirty data, scope creep,

and excessive false-positive indicators. You will also learn methods used in the past to overcome

these obstacles, and how to help your enterprise implement a cost-effective and successful data

analytics initiative.

You will learn how to:

Identify potential data sources.

Avoid pitfalls of data analytics.

Employ methods to improve the accuracy of your analysis.

Find and transform dirty data.

STEVEN KONECNY, CFE, CIRA, CEH

Director

Ueltzen & Company

Gold River, CA

Steven Konecny is a high-tech investigator and business consultant who specializes in the

utilization of information technology and information analysis within complex corporate

disputes, investigations, litigation, receiverships, and business turnarounds. He has extensive

experience in computer forensics; cybercrime and fraud investigations; e-discovery; complex

data analytics; and providing expert technology consulting services for distressed companies.

Prior to joining Ueltzen & Company, he founded—and worked for over a decade at—a boutique

technology investigations and software development firm. He has also worked within the

Forensics Technology Solutions group of a Big Four accounting firm, where he managed

complex litigation and investigative cases.

“Association of Certified Fraud Examiners,” “Certified Fraud Examiner,” “CFE,” “ACFE,” and the

ACFE Logo are trademarks owned by the Association of Certified Fraud Examiners, Inc. The contents of

this paper may not be transmitted, re-published, modified, reproduced, distributed, copied, or sold without

the prior consent of the author.

WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS

26th

Annual ACFE Fraud Conference and Exhibition ©2015 1

NOTES You’ve read the books, attended the classes, and have a

strong grasp on the queries, reports, and analytics you want

to run against your data to help detect indicators of

potential fraud. So, now just plug that analytics thing on top

of your data, hit the switch to on, and wait for the magic to

happen. Soon you will be in analytics bliss, pouring over

statistics, fraud schemes, numbers, and suspects. But, are

your numbers suspect?

Data analytics projects can be extremely complex

endeavors. Just like any form of complex projects, they are

not immune from failure. These projects follow a path very

similar to system implementation projects and suffer from

many of the same points of failure. Several high profile

projects within governmental agencies recently made

headlines for their massive system implementation failures

that were to replace existing legacy applications. Many of

these failures went into the hundreds of millions of dollars.

The causes of those project failures were many, but there

often stands common trends that can be found by looking

into past project failures. Understanding the common areas

of where and how system failures occur in data analytics

projects is essential to implementing the appropriate risk

mitigation countermeasures with in the project plan.

Underestimation of Complexity

It is all too common that when software vendors are

providing demonstrations of their programs or that

analytics book is describing the steps to running their

queries, what they overlook and don’t tell you is that they

are assuming a well-structured and cleansed database

environment exists in the underlying data. Their

environment is typically small, controlled, simplistic, and

structured. The data usually isn’t from real-world data

sources, or, if it is, it has been highly treated or manipulated

and transformed prior to being used as test data.


26th


NOTES In the real world, the source databases for data analytics

may be poorly designed, dirty, inaccurate, voluminous in

data, originate from both internal and external systems, be a

legacy main frame, web-application based, multi-media and

multi-source data host. The data analyst may be dealing

with not only databases but flat files, spreadsheets,

electronic reports, EDI, or XML files. There may be

multiple operating systems, some of which may not be

supported by their original creators any longer. Systems

may be in multiple countries with multiple languages and

currencies. Some of the systems may be running on

networks that cannot communicate with each other.

Unless the data warehouse has already been built, the most

time consuming and costliest portion of most data analytics

projects is usually in the extraction, transformation, and

loading “ETL” of the data. This takes the data from legacy

data sources to an interim data cleansing area (sometimes

referred to as operational data store), and finally to an

analysis database, or if the project calls for a more robust

architecture to a data warehouse. Many data analytics

projects can expect to spend 60–80 percent of their time on

the ETL process, and in some instances even more.


26th


NOTES

Figure 1 – Data Transformation Process

The applications where data resides may often suffer

from a lack of process controls when users were

accessing the application. There may have been no edit

checking or standards when entering the data. Different

users may enter data differently into the same fields;

processes may or may not have been followed

consistently. The underlying data even within only one

data source will often have quality and consistency

problems that, unless transformed, will give inaccurate

results when running analytics. This problem will only

be compounded greatly when combing more data

sources into an aggregated analytics platform.

Other items to consider when cleaning the data include:

Test data – When systems are first being

implemented and users are being trained, this data

may not have been removed prior to full

implementation of the system.

Internal use data – Might consist of employees or

special codes only used within an organization but


26th


NOTES outside the scope of analysis, and could impair or

make analysis results suspect.

Rolled back transactions – Transactions that have

been canceled but still reside in the database. In

many systems, an audit table may cancel the

transaction rather than removing it from the

database.

Incomplete data – Occurs when not all of the

information about a transaction was entered into the

database.

Incorrect data – Wrongly entered data; probably the

hardest to correct when cleansing data, as there

generally is no pattern associated with it.

Inconsistent data – When the data that should be

formatted in a similar manner is not, or data that

should be classified as a certain variable may have

more than one variable associated with it.

Duplicate data – Transactions that repeat

themselves either by error or because they had been

entered in multiple times.

An easy to understand example of data quality

problems is illustrated in the address table shown in

Figure 2. In the example, we have a sample from an

address table taken from an enterprise resource system

“ERP” database vendor address. In this particular

system, the address line has four separate fields, as well

as a zip code field. The city and state fields are not

shown.


26th


NOTES

Figure 2 – Sample Address Table

In looking at the data in the table, a number of items

quickly become apparent from a data quality and

consistency perspective. First, there is no consistency

amongst the addresses. Address one might contain the

actual street address, or address three might contain it.

Looking at each individual line item reveals the

following data quality problems:

Line 3, 12: DBA “doing business as”

Line 5, 8, 9, 19: Business names

Line 6: Malformed characters

Line 14: Null values

Line 5, 11: Four-character zip codes

Line 4: City and state in address field

Line 24, 26 are duplicates

While the solution to detecting and cleaning data

problems within the address table may appear self-


26th


NOTES evident in this example, it becomes quickly

compounded when taking into consideration other

address tables within this ERP system. Even within

other ERP systems, they may use different formatting

for how they structure their addresses. Now, if applied

to other types of application data that contains

addresses, the original data structure can be a virtual

pot-porri of data structures not even considering the

quality of the process being used to enter the data into

each of those systems.

The address tables are easy to understand, as there

already exists a somewhat global standard for how

addresses should be formatted and at least most

business application users around the world are

knowledgeable with address routing format protocols.

All the other tables that could potentially exist in the

data sources will exponentially compound the data

cleansing and transformation effort in ways that could

only be understood once a systematic approach to

understanding the data within each table is undertaken.

Prior to building the analytics database, identifying the

leading sources of data quality problem areas within the

source data sets must be accomplished first. By

performing a detailed data audit against the source

systems within the analysis phase of the project, the

scope of time and cost that will be required to transform

the source data to the analytics database can be

determined. A battery of queries and tests are run

against the data within each table for such an audit to

include:

Minimum and maximum values

Value ranges

Frequency of values

Variances


26th


NOTES Uniqueness

Occurrences

String patterns

Failure to Engage the Stakeholders

A crack team from XYZ Analytics was retained by Big

Bucks Financials to aid Big Bucks’ investigative unit in

uncovering potential indicators of fraud within their

databases. Over a few days, analysts from XYZ sat down

with members of the investigative unit and they discussed

how the unit typically conducted an investigation. They

talked over how reports were generated from the databases

and collated in such a manner to aid the investigators in

determining which would be the best targets to look further

into. After a couple of weeks of examining Big Bucks’

investigative team’s practices, XYZ Analytics took their

knowledge to work in building their analytics reporting

engine.

A few months passed and XYZ Analytics returned to Big

Bucks Financials with the results. They had taken the

positive and negative data sets provided by Big Bucks, as

well as the business rules they had learned through the

interview process and had programmed the information

into their predictive engine. This was then applied against

the entire data population and reports returned that

predicted which entities were most likely to have

committed fraud. But after reviewing the results, the Big

Bucks’ investigative team could not understand why the

entities selected were on the report and could not find

enough justification in the historical data within their

databases. When the analysts from XYZ were confronted

on how the results were derived, it was quickly determined

that there was a large disconnect between the understanding

of the parties.


26th


NOTES Developing the strategy for how to use the data and apply

advanced analytics against it in order to facilitate better

decision making doesn’t occur in a one-hour planning

session. Failure to engage the stakeholders can lead to

developing systems that no one wants, can use, or can lead

to a complete breakdown of expectations of what the

projects purpose is. It can also lead to a rigid project design

that was built upon requirements made when knowledge

about the underlying source data was limited or discoveries

of new directions of analytics weren’t known. It can also

lead to a habit of trying to predict the needs of the end users

instead of interacting with them to identify their needs.

At the same time, a failure to establish and stabilize

requirements could lead to constant changes in

requirements and continual re-work to meet the ever-

changing requirements. There needs to be a healthy balance

between flexibility and stabilization, and that occurs best

when the data analysts, management, investigators, and

those consumers of the analytics jointly plan together.

Failure to jointly plan can bring about the collapse of the

project.

Most analytics projects are specifically aimed at fulfilling

selected requirements of the end users. Thus,

representatives of the users are expected to compile their

requirements and forward tem back to the data analysts.

The stakeholders always expect to obtain an end-product

that absolutely fulfills their needs. Unfortunately, it’s all

too possible for the data analysts to omit important aspect

of users’ requirements during the initial compilation. Such

an omission will lead to the production of a platform that is

incapable of fulfilling all the needs of the users. Such a

defect may eventually lead to the collapse of the data

analytics project as the users will be very reluctant to utilize

the product.


26th


NOTES To mitigate the possibility of this occurring, a couple

approaches work well. While all projects will typically

have everyone together in the beginning to define

requirements, a data analyst and business stakeholder

should be paired together throughout the entire process.

The business stakeholder should be able to make

recommendations about directions that the analysis should

take based upon knowledge of the industry that the data

analyst may not have. While, at the same time, the data

analyst will be revealing information about the data within

the enterprise’s databases to the stakeholder that may not

have been known earlier.

Using the Wrong Tool for the Wrong Problems

During litigation, a few million email messages were

produced and had to be reviewed for responsiveness.

Manually reviewing a few million email messages is

extremely cumbersome and costly and could take a team of

ten attorneys a year to do. A more automated approach was

decided upon. Email messages can be stored just like a

transaction in a database, with each part of the message

such as the “To,” “From,” “Subject,” and “Body” being

fields in the table of the database. So, it was decided that

the data from the email would be applied against a

predictive coding tool.

Predictive coding for e-discovery varies between different

software packages. In most forms that are considered

“technology assisted review” (TAR), a subset of the total

population of documents is randomly selected for a training

process. A group of investigators knowledgeable about the

matter will read each document and identify those that are

relevant to the matter at hand, as well as those that are

irrelevant. Using various algorithms based upon the

documents’ coding, the software analyzes the entire

population of documents to identify similar documents as


26th


NOTES those that were relevant within its universe. Relevant

documents typically will be given a weight based upon the

proximity to documents identified as relevant by the human

reviewers with the TAR software.

Predictive coding for e-discovery can be very useful but

also expensive. It is not something that should be applied

lightly. When the email was processed into a database, a

number of queries were executed to begin to understand the

population at hand. Filters were applied to the email to

remove all messages coming from newsgroups and those

that fell outside of the date range. Another set of filters was

applied to remove non-responsive document sets. By the

time the data analysts finished applying filters to the

database, the relevant document set was manageable

enough that predictive coding was not necessary. While the

process of predictive coding had begun, it was quickly put

to a halt.

There’s a lot of interesting technology on the market for

analytics, and more becomes available all the time. The

promises of the software vendors, the flashy new products,

and the unrealistic dream of how easily these new products

will better solve business problems drive many decision

makers. Once purchased, there is the ongoing drive to try

and fit the product into the data irrespective if it is the right

tool or not. Then there is always the unwillingness to

accept that the approach taken isn’t working and the plug

must be pulled.

This also applies to using “Big Data” technologies for

small data sets. What is “Big Data?” It seems that every

data analytics or data warehousing project nowadays is

called a “Big Data” project. In the 1990s, even before the

days of HADOOP, NoSQL, or MangoDB, legacy databases

could still process a billion transactions. Today, a properly


26th


NOTES configured server could hold and search 30 terabytes of

electronic documents using little more than half the shelf

hard drives. Although there may be a tendency to want the

shiny new cutting edge technologies, many of the old, tried

and true ones might fit just as well.

Software vendors can be notorious for selling vaporware.

Their products will solve everything.

Using Non-Representative “Sample” Data Sets

An online claims management application was developed

to facilitate the claims processing for a large international

Ponzi scheme. There were close to 200,000 people

involved across the globe, and a system to contact them and

let them enter a claim for losses had to be created. In order

to guide the requirements as well as facilitate the training of

the attorneys and investigators, around 100 sample

transactions were entered into the master tables. The

application was finished and the real data from the Ponzi

scheme was loaded so each claimant could see how much

money they put in and received out.

The system went live and after a few months, analysis of

the claims began. It was quickly found that “victims” in the

Ponzi scheme were also filing false claims. What was

revealed in using the real data was that the account

information for individuals could be guessed, as some had

patterns to the numbering system. Almost as if they were

communicating or thieves think alike, individuals around

the world would group together a number of accounts, all

of which would sum less than $50,000, and claim them as

their own. Because an auditing system had been developed

into the claims management system, the method used by

the false claimants was identified, and their locations were

tracked so as not to pay the claims they filed.


26th


NOTES When selecting software for a new data analytics project or

setting up a prototype project to demonstrate to

management the value of implementing a new initiative,

sample data sets are often used. This can be quite

problematic when laying out the scope, budget, and

expectations if the bigger project moves forward. Every

organization usually sets aside a specified amount of

resources for any project it hopes to accomplish within a

particular period of time. The obvious first thing that is

overlooked in not using real data is that the quality of the

data being analyzed cannot be assessed. This means that for

planning purposes, the cost in time and resources for

transforming and loading the data will be unknown. As had

been mentioned earlier in this article, most of a data

analytics project’s budget historically has been eaten up in

the extraction, transformation, and loading of the data.

Secondly, during the initial compilation of project

requirements, the users were seeing data that was not

necessarily realistic to what they might encounter on the

real project. When analytics are run, the users may be

expecting an experience similar to the sample data.

Inaccurate data will ultimately lead to the generation of

wrong queries and reports, which will in turn cause the end

users to make wrong decisions. This discovery will lead to

the mistrust of the process and tools that are used in

generating them. Once this issue occurs, users will have no

other option but to spend a great part of their time in

validating the reports generated by the analytics. The entire

process will eventually slow down the rate of productivity

and culminate in the rejection of the data analytics project.

The quickest way to setting up unrealistic expectations of

the performance and results that can be achieved through

the analytics is by using fake data when engaging with the

stakeholders.


26th


NOTES “…success actually requires avoiding many

separate possible causes of failure.”

—Jared Diamond (Guns, Germs and Steel, 1999)

Date post:	24-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING ... · WHY DATA ANALYTICS INITIATIVES FAIL:...

Documents