Detecting Anomalies in Your Data Using Rounded Numbers · DETECTING ANOMALIES IN YOUR DATA USING...

DETECTING ANOMALIES IN YOUR DATA USING ROUNDED NUMBERSCurtis A. Smith, Defense Contract Audit Agency, La Mirada, CA

ABSTRACTAnalyzing large amounts of data looking for anomalies canbe a disheartening task. You need techniques that will allowyou to quickly assess the data in ways that will highlightpotential anomalies while keeping you from chasing thewind. The rounded number test is one such technique.Using this test and the SAS© System you can quicklyidentify rows in your data whose numeric variables havebeen rounded such that they defy statistical norms,indicating possible invention of numeric values. Within thispaper, the author will present SAS code that will enable youto quickly and easily find anomalies in the data you analyze.The SAS code will include the Data Step, the MERGEstatement, and the FREQ, REPORT, and GPLOTprocedures. The author will also present some findings fromthe data he analyzes. The technique presented is powerful,yet easy to understand and use.

INTRODUCTIONDigital analyses are computer assisted audit techniquesused to analyze large amounts of data in ways thattraditional audit techniques can match. The rounded numbertest is one such analysis. Excessive rounding in numberswithin a universe can be a sign of number invention.Frequently, rounded numbers are invented innocently forconvenience. For example, when my son ran in a 5K racerecently and the media reported that 5,000 people ran, weunderstand that that was not the official number. But, itserves its purpose of being a handy number to use.However, in some cases, rounded numbers are invented forfraudulent or other malicious reasons by persons who do notthink to create numbers that would occur more naturally. Forexample, if someone were to create a false billing, he/shemight chose a number like $125.00, rather than $128.00.Round numbers are defined as numbers that are divisible by10, 25, 100, or 1,000 without leaving a remainder. Whenanalyzing dollar amounts, acommon practice is toignore c e n t s whendetermining if a number isrounded. For example,$3,200.99 would be treatedas $3,200.

Mark Nigrini, Ph.D., states:“The round number tests are most applicable to caseswhere estimation may be occurring but is inappropriatefor the data being audited. Examples would includeinventory counts, overtime hours (here the auditor maywant to classify a whole number as a round number),numbers forming the basis of royalty payments(production levels or dollars of sales), refund amountsor credit memo amounts, or check amounts.”

HOW CAN YOU BENEFIT?If you are in the business of analyzing data, such as thenoble profession of auditing, you might need to look forareas of fraud or areas of anomalies. Dr. Mark J. Nigrini andothers have successfully used the rounded number test todetect potential fraud. Dr. Nigrini termed the use of

analyzing digits within numbers as “digital analysis.” It isdifficult for the fraudster to avoid detection from digitalanalyses because the fraudster typically cannot influence anentire data file. Thus, the fraudster will invariably alternumbers in such a way that defines natural statisticaloccurrences of rounded numbers. But, you don’t have to belooking for fraud to benefit from the rounded number test.There are many non-fraudulent reasons why a universe ofnumbers can violate natural statistical occurrences ofrounded numbers, yet still warrant your investigation.

Many respected digital analysts have written on the subjectof the rounded number test. While there may be somedebate on what theexpected occurrenceof rounded numbersare, the table belowshows those thata p p e a r t o b egenerally accepted,as stated by Dr. MarkJ. Nigrini.

It’s not surprising that rounded by 100 occurs much moreoften than rounded by 1,000 because a number rounded by1000 is also rounded by 100. Similarly, numbers rounded by1,000 and 100 are also rounded by 10.

Start DiggingHere is an example from invoice transactions I haveanalyzed. This data set contains more than 100,000 rows.Notice the table showing the observed versus expecteddistribution and the delta between them. Then notice the plotshowing the same. As you can see below, you’ve got yourwork cut out for you investigating these anomalies as thereare far too many transactions that are rounded to 10, 25,and 100.

High levels of round numbers mayalso signal errors and irregularities,and are generally not reviewed instandard audit procedures.

- Richard Lanza, CPA

R o u n d n u m b e rs u b s e t t e s t :measuring the extentof rounding can helpa u d i t o r s d e t e c tartificial transactionsin inventory counts,overt ime hours,refund amounts, orcheck amounts.

- Dr. Mark Nigrini

You may not get so lucky and find so many anomalies soquickly. Look at the example below from travel claimtransactions I have analyzed. This data set contains morethan 700,000 rows. Notice the table showing the observedversus expected distribution and the delta between them.Then notice the plot showing the same.

Nice curves. In this case, everything looks as according toHoyle, or whoever came up with the expected frequencies.So, a dead end, right? No, just an opportunity to dig deeper.

Dig Deeper By SubsettingYou can further analyze yourdata by first subsetting yourdata. For example, if youanalyze all of the transactionstogether and find nothing, youmight then look at subsets of thedata. In this example, ratherthan looking at all the travelclaim transactions together, youcould look at subsets bydepartment, or by week, or byemployee. You might then seeanomalies showing up. Why?Because if one department oremployee is doing something funny, or if something funnywas being done during one week, looking at the entireuniverse can obscure the funny business. But isolatingsubsets can be revealing. We will begin our quest bysubsetting our travel claims data set by the variableDIVISION.

Notice that when we subset by DIVISION and look atdivision “Z549" we suddenly see some variation from theexpected frequency of rounded numbers, especially whenrounded by 10. Now take a look at our travel claims data setwhen we subset by the variable EMPLOYEE.

Here is the analysis for employee “AB1234". Notice thatwhile the rounded number analysis for all transactions looknormal, the analysis for this one employee does not. Thereare huge variations in rounded numbers 10, 25, and 100. Itmight be interesting to look even further anddetermine if employee “AB1234" works in division“Z549".

So, the lesson is that while a whole universemay display the expected frequencies,evaluating the universe by subsets can berevealing. Inquiring minds want to know...

What to DoSo, what can you do with these anomalies? You can start tocross check them, looking for something in common.Considering our example, look to see if the anomalies withindivisions also occur within employees. Then, you can beginextracting the actual data records that contain the anomaliesand use the information on those data records to get to theroot of the anomalies.

What have I found? Well, I analyzed travel claimtransactions and found that one employee within onedivision had far too many transactions rounded to 10, 25,and 100. The auditors in my office are currently reviewingthe transactions to determine the cause of the anomaly.

SO, HOW DID I USE SAS?To produce these striking results, I used SAS to do six maintasks, which are as follows:

� Determine the expected frequencies of roundednumbers

� Determine if the numbers in the universe are rounded� Determine the observed frequencies of rounded

numbers in the universe� Merge the observed and expected frequencies and

compute the deltas� Create a report of the observed versus expected

distribution� Create a plot of the observed versus

expected distribution

That does not seem too difficult. This is a SASpaper, so let’s look at some code. The code I amshowing is for computing the rounded numbertest of an entire universe. The modificationsneeded to do a subset analysis using a BY groupanalysis are not too different and will not bepresented.

Determine the Expected FrequenciesOur fourth step will be to create a data set that contains theobserved and expected rounded number frequencies.Before we can do that, we need a SAS data set thatcontains the expected rounded number frequencies. Here isthe code I use to create the expected frequency SAS dataset, using the expected frequencies we discussed earlier.For brevity, I am only showing the code for the roundednumbers 10 and 25. The complete code includes additionalstatements for rounded numbers 100 and 1,000.

This Data Step simply creates a new data set containingfour rows, one for each of the four rounded numbers we willtest, and two columns. The columns will contain a label foreach of the four rounded numbers (10, 25, 1000, and 1000)and the associated frequency.

Determine If the Numbers in the Universe Are RoundedBefore we can determine the observed rounded numberfrequencies we need to determine if a number is rounded.We want to determine if the numeric variable in our inputSAS data set is rounded by 10, 25, 100, and/or 1,000.Therefore, we must pass each row through four sets oftests. Remember, when the numeric variable is dollars, wewant to ignore the cents. Here is the code I use to determineif the numeric variable is rounded. Again, for brevity, I amonly showing the code for the rounded numbers 10 and 25.The complete code includes additional statements forrounded numbers 100 and 1,000.

In the code above, we create a new SAS data set thatcontains three new variables ROUND_BY, ROUND_NO,and ROUNDED. For each row processed, we want to createfour output rows, one for each ROUND_BY value: “10", “25",“100", and “1000". We also want to create a valueassociated with each of those rows containing a numeric 1if the input number is rounded or a numeric 0 if the inputnumber is not rounded.

We want ROUND_BY to be a text formatted label, so weassign it with a LENGTH statement and set its length to $4.,so that it can hold the longest value “1000". We then readan observation from our input SAS data set - in this caseIN.TRAVEL. We only need the input numeric variable, so weuse the KEEP= option to keep, in this case, the variableAMOUNT. We then create variable ROUND_NO and usethe INT function to set it to the integer value of the AMOUNTvariable. If ROUND_NO is equal to 0 (meaning the value isbetween .99 and -.99) we delete the row.

data work.expected(index=(round_by));

length round_by $4.;

format Expected 8.3;

round_by = "10";

expected = 10.000;

output;

round_by = "25";

expected = 04.000;

output;

round_by = "100";

expected = 01.000;

output;

round_by = "1000";

expected = 00.100;

output;

run;

data work.observed(index=(round_by));

length round_by $4.;

set in.travel(keep=amount);

round_no=int(amount);

if round_no = 0 then delete;

if mod(round_no,10) eq 0 then

do;

rounded = 1;

round_by = "10";

end;

else if mod(round_no,10) ne 0 then

do;

rounded = 0;

round_by = "10";

end;

output;

if mod(round_no,25) eq 0 then

do;

rounded = 1;

round_by = "25";

end;

else if mod(round_no,25) ne 0 then

do;

rounded = 0;

round_by = "25";

end;

output;

run;

Digital Analysis is an audit technology designed tofind abnormal duplications of specific digits, digitcombinations, specific numbers, and round numbersin corporate data.

- Dr. Mark Nigrini

We then interrogate the ROUND_NO variable up to 2 timesfor each of our 4 rounded numbers. The MOD functionreturns the remainder when the integer quotient of the firstargument is divided by the second argument. So, in thestatement “if mod(round_no,10) eq 0" SAS takes the valueof ROUND_NO (which is the integer value of the inputnumeric variable), divides it by 10 and determines theremainder. If the remainder equals 0, then the nextstatements in our code tell SAS to set the ROUNDEDvariable to 1 and the ROUND_BY variable to “10". Thus, ifthe integer value of the input numeric variable is divisible by10 with no remainder, the number is rounded by 10 and weset our flag variable ROUNDED to 1 (true). If the result ofthe MOD statement is not 0, then there is a remainder.Thus, the integer value of the input numeric variable is notdivisible by 10 with no remainder, demonstrating the numberis not rounded by 10. In this case, we set our flag variableROUNDED to 0 (false). At this point, we have either set theROUNDED variable to 1 or 0 and have set the ROUND_BYvariable to “10". We then use the OUTPUT statement towrite a row to our output SAS data set that will contain theinput numeric variable (in this case AMOUNT), theROUND_NO variable containing the integer value of theinput numeric variable, the ROUND_BY variable, and theROUNDED variable. We repeat this code logic for therounded numbers 25, 100, and 1000. An example of theresulting SAS data set follows.

When checking the SAS log for the number of rows writtento the output SAS data set, don’t be concerned if there arenot 4 times as many rows in the output as there are in theinput. Remember, when the integer value of the inputnumeric variable is 0 we deleted the row. So, it will not beunusual for the output number of rows to be something lessthan 4 times the number of input rows.

Determine the Observed FrequenciesWe now want to determine the observed frequencies of therounded numbers, which are stored in our ROUNDEDvariable. A simple FREQ procedure will do nicely.

There is nothing in this code that needs discussing otherthan to note that I rename the resulting frequency variablefrom the default “percent” to “Observed” so as to be unique

when we merge this SAS data set with the SAS data setcontaining the expected frequencies. Remember Imentioned that when we want to perform the analysis bysubsetting the data there is only a minor change to thecode? Well the main difference is right here. To do thesubsetting, you just have to modify the BY statement in theFREQ procedure to read “BY byvar ROUND_BY”, where“byvar” is the variable on which you want to subset theuniverse. The results of this FREQ procedure will look likethe following:

Merge the Frequencies and Compute the DeltasNow we have a SAS data set containing the expectedrounded numbers and their associated frequencies and wehave another SAS data set containing the frequencies of theobserved rounded and non-rounded numbers. The next stepis to merge the two and compute the difference between theexpected and observed. That sounds like a simple DataStep with a MERGE statement. Here is the code:

This data step simply creates a new data set by merging theobserved and expected data sets by the ROUND_BYvariable and creates a new variable, DELTA, containing thedifference between the observed and expected frequencies.Our observed data set contains 8 rows, 4 for theoccurrences of the values that are rounded by 10, 25, 100,and 1000 and 4 for the occurrences of values that are notrounded. We don’t want the values that are not rounded.Remember these are identified as those where ROUNDED= 0. So, using the WHERE clause on the observed data set(ROUNDED) set to equal 1 will include only the rows thatare rounded.

proc freq data=work.observed;

tables rounded/out=work.rounded

(rename=(percent=Observed));

by round_by;

run;

data out.rounded;

merge work.expected(in=a)

work.rounded(in=b where=(rounded=1));

by round_by;

Delta=sum(observed,-expected);

label expected="Expected"

observed="Observed"

delta="Delta";

run;

Create a Report of the Observed Versus ExpectedNow that the hard part is over, we can create our output.The comparison report is done with a simple REPORTprocedure. Here is the fundamental code I use to create thetabular report with the observed versus expected distributionand delta. I accomplished the pretty version shown earlierusing the Output Delivery System.

Create a Graph of the Observed Versus ExpectedLastly, we want to create an easy to assimilate graph of ourresults. Here is the fundamental code I use to create theoverlay plot showing the observed, expected, and deltaregression lines. This code simply uses the GPLOTprocedure to create an overlay plot of the OBSERVED,EXPECTED, and DELTA variables over the ROUND_BYvariable accomplished the pretty version shown earlier usingthe Output Delivery System.

If you are not familiar with creating graphsusing the Output Delivery System, feel freeto review my paper on the subject fromWUSS 2002.

My actual code is a bit more complicated,as I use macro variables to allow the userto make specifications before running theapplication. I also use ODS statements tomake HTML files with ActiveX controls in myGPLOT procedure output.

CONCLUSIONAnalyzing large amounts of data for anomalies or potentialfraud does not have to be a disheartening task. Using digitalanalyses, such as the rounded number test, you can easilyfind anomalies in your data. It was not my intention withinthis paper to provide all of the SAS code I used to create theoutput shown within this paper. If you would like my code,send me an e-mail request. For a limited time my code isfree, in exchange for your rounded number success stories.

REFERENCES“Digital Analysis - Part 2: A Review of the Audit Tests toDetect Anomalies in Data Subsets”, Mark J. Nigrini, Ph.D.,http://www.theiia.org/itaudit/index.cfm?fuseaction=forum&fid=96“Digital Analysis – Real World Examples”, Richard Lanza,CPA http://www.theiia.org/itaudit/index.cfm?fuseaction=forum&fid=58“Digital Analysis Tests and Statistics - Using Digit andNumber Patterns to Detect Fraud, Errors, Biases,Irregularities, and Processing Inefficiencies, and to TestNew Computer Systems and Year 2000 Conversions”, MarkJ. Nigrini, Ph.D.,

CONTACT INFORMATIONYour comments and questions are valued and encouraged.Contact the author at:

Curtis A. SmithDefense Contract Audit AgencyP.O. Box 20044Fountain Valley, CA 92728-0044Work Phone: 714-896-4277Fax: 413-383-6395email: [email protected]

proc report data=out.rounded nowindows

missing;

col round_by count observed expected

delta;

define round_by /order 'Rounded By';

define count / format=comma12.

'Observed Frequency Count';

define expected /format=8.3

'Expected Frequency Percent';

define observed /format=8.3

'Observed Frequency Percent';

define delta / format=8.3

'Observed - Expected Frequency Percent';

run;

proc gplot data=out.rounded;

plot3d

(expected observed delta)*round_by/ overlay;

run;

quit

Date post:	25-Jul-2018
Category:	Documents
Upload:	phungnhi
View:	227 times
Download:	0 times

Detecting Anomalies in Your Data Using Rounded Numbers · DETECTING ANOMALIES IN YOUR DATA USING...

Documents