©2015
WHY DATA ANALYTICS INITIATIVES FAIL:
TIPS FOR BUILDING SUCCESSFUL PROGRAMS
You’ve read the books, taken the classes, and have a strong grasp on the queries, reports, and
analytics you want to run against your data to help detect indicators of potential fraud. So why do
various studies suggest that less than 5 percent of all fraudulent or illegal activities are detected
through automated software within an organization? Through real-life examples this session will
discuss obstacles that might endanger your program’s success, such as dirty data, scope creep,
and excessive false-positive indicators. You will also learn methods used in the past to overcome
these obstacles, and how to help your enterprise implement a cost-effective and successful data
analytics initiative.
You will learn how to:
Identify potential data sources.
Avoid pitfalls of data analytics.
Employ methods to improve the accuracy of your analysis.
Find and transform dirty data.
STEVEN KONECNY, CFE, CIRA, CEH
Director
Ueltzen & Company
Gold River, CA
Steven Konecny is a high-tech investigator and business consultant who specializes in the
utilization of information technology and information analysis within complex corporate
disputes, investigations, litigation, receiverships, and business turnarounds. He has extensive
experience in computer forensics; cybercrime and fraud investigations; e-discovery; complex
data analytics; and providing expert technology consulting services for distressed companies.
Prior to joining Ueltzen & Company, he founded—and worked for over a decade at—a boutique
technology investigations and software development firm. He has also worked within the
Forensics Technology Solutions group of a Big Four accounting firm, where he managed
complex litigation and investigative cases.
“Association of Certified Fraud Examiners,” “Certified Fraud Examiner,” “CFE,” “ACFE,” and the
ACFE Logo are trademarks owned by the Association of Certified Fraud Examiners, Inc. The contents of
this paper may not be transmitted, re-published, modified, reproduced, distributed, copied, or sold without
the prior consent of the author.
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 1
NOTES You’ve read the books, attended the classes, and have a
strong grasp on the queries, reports, and analytics you want
to run against your data to help detect indicators of
potential fraud. So, now just plug that analytics thing on top
of your data, hit the switch to on, and wait for the magic to
happen. Soon you will be in analytics bliss, pouring over
statistics, fraud schemes, numbers, and suspects. But, are
your numbers suspect?
Data analytics projects can be extremely complex
endeavors. Just like any form of complex projects, they are
not immune from failure. These projects follow a path very
similar to system implementation projects and suffer from
many of the same points of failure. Several high profile
projects within governmental agencies recently made
headlines for their massive system implementation failures
that were to replace existing legacy applications. Many of
these failures went into the hundreds of millions of dollars.
The causes of those project failures were many, but there
often stands common trends that can be found by looking
into past project failures. Understanding the common areas
of where and how system failures occur in data analytics
projects is essential to implementing the appropriate risk
mitigation countermeasures with in the project plan.
Underestimation of Complexity
It is all too common that when software vendors are
providing demonstrations of their programs or that
analytics book is describing the steps to running their
queries, what they overlook and don’t tell you is that they
are assuming a well-structured and cleansed database
environment exists in the underlying data. Their
environment is typically small, controlled, simplistic, and
structured. The data usually isn’t from real-world data
sources, or, if it is, it has been highly treated or manipulated
and transformed prior to being used as test data.
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 2
NOTES In the real world, the source databases for data analytics
may be poorly designed, dirty, inaccurate, voluminous in
data, originate from both internal and external systems, be a
legacy main frame, web-application based, multi-media and
multi-source data host. The data analyst may be dealing
with not only databases but flat files, spreadsheets,
electronic reports, EDI, or XML files. There may be
multiple operating systems, some of which may not be
supported by their original creators any longer. Systems
may be in multiple countries with multiple languages and
currencies. Some of the systems may be running on
networks that cannot communicate with each other.
Unless the data warehouse has already been built, the most
time consuming and costliest portion of most data analytics
projects is usually in the extraction, transformation, and
loading “ETL” of the data. This takes the data from legacy
data sources to an interim data cleansing area (sometimes
referred to as operational data store), and finally to an
analysis database, or if the project calls for a more robust
architecture to a data warehouse. Many data analytics
projects can expect to spend 60–80 percent of their time on
the ETL process, and in some instances even more.
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 3
NOTES
Figure 1 – Data Transformation Process
The applications where data resides may often suffer
from a lack of process controls when users were
accessing the application. There may have been no edit
checking or standards when entering the data. Different
users may enter data differently into the same fields;
processes may or may not have been followed
consistently. The underlying data even within only one
data source will often have quality and consistency
problems that, unless transformed, will give inaccurate
results when running analytics. This problem will only
be compounded greatly when combing more data
sources into an aggregated analytics platform.
Other items to consider when cleaning the data include:
Test data – When systems are first being
implemented and users are being trained, this data
may not have been removed prior to full
implementation of the system.
Internal use data – Might consist of employees or
special codes only used within an organization but
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 4
NOTES outside the scope of analysis, and could impair or
make analysis results suspect.
Rolled back transactions – Transactions that have
been canceled but still reside in the database. In
many systems, an audit table may cancel the
transaction rather than removing it from the
database.
Incomplete data – Occurs when not all of the
information about a transaction was entered into the
database.
Incorrect data – Wrongly entered data; probably the
hardest to correct when cleansing data, as there
generally is no pattern associated with it.
Inconsistent data – When the data that should be
formatted in a similar manner is not, or data that
should be classified as a certain variable may have
more than one variable associated with it.
Duplicate data – Transactions that repeat
themselves either by error or because they had been
entered in multiple times.
An easy to understand example of data quality
problems is illustrated in the address table shown in
Figure 2. In the example, we have a sample from an
address table taken from an enterprise resource system
“ERP” database vendor address. In this particular
system, the address line has four separate fields, as well
as a zip code field. The city and state fields are not
shown.
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 5
NOTES
Figure 2 – Sample Address Table
In looking at the data in the table, a number of items
quickly become apparent from a data quality and
consistency perspective. First, there is no consistency
amongst the addresses. Address one might contain the
actual street address, or address three might contain it.
Looking at each individual line item reveals the
following data quality problems:
Line 3, 12: DBA “doing business as”
Line 5, 8, 9, 19: Business names
Line 6: Malformed characters
Line 14: Null values
Line 5, 11: Four-character zip codes
Line 4: City and state in address field
Line 24, 26 are duplicates
While the solution to detecting and cleaning data
problems within the address table may appear self-
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 6
NOTES evident in this example, it becomes quickly
compounded when taking into consideration other
address tables within this ERP system. Even within
other ERP systems, they may use different formatting
for how they structure their addresses. Now, if applied
to other types of application data that contains
addresses, the original data structure can be a virtual
pot-porri of data structures not even considering the
quality of the process being used to enter the data into
each of those systems.
The address tables are easy to understand, as there
already exists a somewhat global standard for how
addresses should be formatted and at least most
business application users around the world are
knowledgeable with address routing format protocols.
All the other tables that could potentially exist in the
data sources will exponentially compound the data
cleansing and transformation effort in ways that could
only be understood once a systematic approach to
understanding the data within each table is undertaken.
Prior to building the analytics database, identifying the
leading sources of data quality problem areas within the
source data sets must be accomplished first. By
performing a detailed data audit against the source
systems within the analysis phase of the project, the
scope of time and cost that will be required to transform
the source data to the analytics database can be
determined. A battery of queries and tests are run
against the data within each table for such an audit to
include:
Minimum and maximum values
Value ranges
Frequency of values
Variances
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 7
NOTES Uniqueness
Occurrences
String patterns
Failure to Engage the Stakeholders
A crack team from XYZ Analytics was retained by Big
Bucks Financials to aid Big Bucks’ investigative unit in
uncovering potential indicators of fraud within their
databases. Over a few days, analysts from XYZ sat down
with members of the investigative unit and they discussed
how the unit typically conducted an investigation. They
talked over how reports were generated from the databases
and collated in such a manner to aid the investigators in
determining which would be the best targets to look further
into. After a couple of weeks of examining Big Bucks’
investigative team’s practices, XYZ Analytics took their
knowledge to work in building their analytics reporting
engine.
A few months passed and XYZ Analytics returned to Big
Bucks Financials with the results. They had taken the
positive and negative data sets provided by Big Bucks, as
well as the business rules they had learned through the
interview process and had programmed the information
into their predictive engine. This was then applied against
the entire data population and reports returned that
predicted which entities were most likely to have
committed fraud. But after reviewing the results, the Big
Bucks’ investigative team could not understand why the
entities selected were on the report and could not find
enough justification in the historical data within their
databases. When the analysts from XYZ were confronted
on how the results were derived, it was quickly determined
that there was a large disconnect between the understanding
of the parties.
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 8
NOTES Developing the strategy for how to use the data and apply
advanced analytics against it in order to facilitate better
decision making doesn’t occur in a one-hour planning
session. Failure to engage the stakeholders can lead to
developing systems that no one wants, can use, or can lead
to a complete breakdown of expectations of what the
projects purpose is. It can also lead to a rigid project design
that was built upon requirements made when knowledge
about the underlying source data was limited or discoveries
of new directions of analytics weren’t known. It can also
lead to a habit of trying to predict the needs of the end users
instead of interacting with them to identify their needs.
At the same time, a failure to establish and stabilize
requirements could lead to constant changes in
requirements and continual re-work to meet the ever-
changing requirements. There needs to be a healthy balance
between flexibility and stabilization, and that occurs best
when the data analysts, management, investigators, and
those consumers of the analytics jointly plan together.
Failure to jointly plan can bring about the collapse of the
project.
Most analytics projects are specifically aimed at fulfilling
selected requirements of the end users. Thus,
representatives of the users are expected to compile their
requirements and forward tem back to the data analysts.
The stakeholders always expect to obtain an end-product
that absolutely fulfills their needs. Unfortunately, it’s all
too possible for the data analysts to omit important aspect
of users’ requirements during the initial compilation. Such
an omission will lead to the production of a platform that is
incapable of fulfilling all the needs of the users. Such a
defect may eventually lead to the collapse of the data
analytics project as the users will be very reluctant to utilize
the product.
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 9
NOTES To mitigate the possibility of this occurring, a couple
approaches work well. While all projects will typically
have everyone together in the beginning to define
requirements, a data analyst and business stakeholder
should be paired together throughout the entire process.
The business stakeholder should be able to make
recommendations about directions that the analysis should
take based upon knowledge of the industry that the data
analyst may not have. While, at the same time, the data
analyst will be revealing information about the data within
the enterprise’s databases to the stakeholder that may not
have been known earlier.
Using the Wrong Tool for the Wrong Problems
During litigation, a few million email messages were
produced and had to be reviewed for responsiveness.
Manually reviewing a few million email messages is
extremely cumbersome and costly and could take a team of
ten attorneys a year to do. A more automated approach was
decided upon. Email messages can be stored just like a
transaction in a database, with each part of the message
such as the “To,” “From,” “Subject,” and “Body” being
fields in the table of the database. So, it was decided that
the data from the email would be applied against a
predictive coding tool.
Predictive coding for e-discovery varies between different
software packages. In most forms that are considered
“technology assisted review” (TAR), a subset of the total
population of documents is randomly selected for a training
process. A group of investigators knowledgeable about the
matter will read each document and identify those that are
relevant to the matter at hand, as well as those that are
irrelevant. Using various algorithms based upon the
documents’ coding, the software analyzes the entire
population of documents to identify similar documents as
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 10
NOTES those that were relevant within its universe. Relevant
documents typically will be given a weight based upon the
proximity to documents identified as relevant by the human
reviewers with the TAR software.
Predictive coding for e-discovery can be very useful but
also expensive. It is not something that should be applied
lightly. When the email was processed into a database, a
number of queries were executed to begin to understand the
population at hand. Filters were applied to the email to
remove all messages coming from newsgroups and those
that fell outside of the date range. Another set of filters was
applied to remove non-responsive document sets. By the
time the data analysts finished applying filters to the
database, the relevant document set was manageable
enough that predictive coding was not necessary. While the
process of predictive coding had begun, it was quickly put
to a halt.
There’s a lot of interesting technology on the market for
analytics, and more becomes available all the time. The
promises of the software vendors, the flashy new products,
and the unrealistic dream of how easily these new products
will better solve business problems drive many decision
makers. Once purchased, there is the ongoing drive to try
and fit the product into the data irrespective if it is the right
tool or not. Then there is always the unwillingness to
accept that the approach taken isn’t working and the plug
must be pulled.
This also applies to using “Big Data” technologies for
small data sets. What is “Big Data?” It seems that every
data analytics or data warehousing project nowadays is
called a “Big Data” project. In the 1990s, even before the
days of HADOOP, NoSQL, or MangoDB, legacy databases
could still process a billion transactions. Today, a properly
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 11
NOTES configured server could hold and search 30 terabytes of
electronic documents using little more than half the shelf
hard drives. Although there may be a tendency to want the
shiny new cutting edge technologies, many of the old, tried
and true ones might fit just as well.
Software vendors can be notorious for selling vaporware.
Their products will solve everything.
Using Non-Representative “Sample” Data Sets
An online claims management application was developed
to facilitate the claims processing for a large international
Ponzi scheme. There were close to 200,000 people
involved across the globe, and a system to contact them and
let them enter a claim for losses had to be created. In order
to guide the requirements as well as facilitate the training of
the attorneys and investigators, around 100 sample
transactions were entered into the master tables. The
application was finished and the real data from the Ponzi
scheme was loaded so each claimant could see how much
money they put in and received out.
The system went live and after a few months, analysis of
the claims began. It was quickly found that “victims” in the
Ponzi scheme were also filing false claims. What was
revealed in using the real data was that the account
information for individuals could be guessed, as some had
patterns to the numbering system. Almost as if they were
communicating or thieves think alike, individuals around
the world would group together a number of accounts, all
of which would sum less than $50,000, and claim them as
their own. Because an auditing system had been developed
into the claims management system, the method used by
the false claimants was identified, and their locations were
tracked so as not to pay the claims they filed.
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 12
NOTES When selecting software for a new data analytics project or
setting up a prototype project to demonstrate to
management the value of implementing a new initiative,
sample data sets are often used. This can be quite
problematic when laying out the scope, budget, and
expectations if the bigger project moves forward. Every
organization usually sets aside a specified amount of
resources for any project it hopes to accomplish within a
particular period of time. The obvious first thing that is
overlooked in not using real data is that the quality of the
data being analyzed cannot be assessed. This means that for
planning purposes, the cost in time and resources for
transforming and loading the data will be unknown. As had
been mentioned earlier in this article, most of a data
analytics project’s budget historically has been eaten up in
the extraction, transformation, and loading of the data.
Secondly, during the initial compilation of project
requirements, the users were seeing data that was not
necessarily realistic to what they might encounter on the
real project. When analytics are run, the users may be
expecting an experience similar to the sample data.
Inaccurate data will ultimately lead to the generation of
wrong queries and reports, which will in turn cause the end
users to make wrong decisions. This discovery will lead to
the mistrust of the process and tools that are used in
generating them. Once this issue occurs, users will have no
other option but to spend a great part of their time in
validating the reports generated by the analytics. The entire
process will eventually slow down the rate of productivity
and culminate in the rejection of the data analytics project.
The quickest way to setting up unrealistic expectations of
the performance and results that can be achieved through
the analytics is by using fake data when engaging with the
stakeholders.
WHY DATA ANALYTICS INITIATIVES FAIL: TIPS FOR BUILDING SUCCESSFUL PROGRAMS
26th
Annual ACFE Fraud Conference and Exhibition ©2015 13
NOTES “…success actually requires avoiding many
separate possible causes of failure.”
—Jared Diamond (Guns, Germs and Steel, 1999)