Fraud Auditing, Detection, and Prevention Blog

Uncovering Fraud Using Fraud Data Analytics: Part 3 + 4

Mar 26, 2019 4:55:40 PM / by Leonard W. Vona

Fraud Analytics: Planning considerations

In our first blog in a series on fraud data analytics, we identified a ten-step methodology for conducting a fraud data analytics project. In this blog, we will discuss steps four and five:

4. What decisions are needed regarding the availability, reliability and usability of the data?

5. Do you understand the data?

The purpose of assessing the availability and reliability of the data is to determine if the data are usable for fraud data analytics. The availability focuses on the completeness of the data, whereas reliability focuses on the overt accuracy of the data for the planned tests. The word overt is used with intent. Most data are derived from a document. The reliability test cannot determine if the data are entered correctly, but rather it is a search  for readily apparent data error. To illustrate, suppose the scope period is 2019, but the vendor invoice date in the database is 07/19/1925. Clearly, 1925 year is not the correct date on the vendor invoice.

Data Availability

The intent of the entity availability analysis is to determine how much data is available in each entity data field. In small databases, an easy test is to sort on each data column that is included in the test and then count the number of fields that are blank or contain a dash. In large databases, use the count feature to determine the number of blanks or dashes in each column. The important step is to determine the availability of data for planned tests.   

A similar process should be used to determine the availability of transactional data. The key is to ensure that the critical fields are populated.

The second step of the availability analysis is matching the various tables to determine how many of the transactions have been completed. To illustrate, in the expenditure cycle there should be a purchase order, vendor invoice and a payment transaction. In the availability analysis, you are determining whether there is a purchase order, invoice and payment for each transaction. Often there will be a few transactions missing a purchase order or payment information due to timing and aging. However, prior to data interrogation you should know how many incomplete transactions are within your audit population.

Data Reliability

The reliability testing for the entity data does not lend itself to fraud data analytics. However, the reliability test for transactions is critical for fraud data analytics to avoid false positives. For instance, control numbers originating from source documents are not typically verified. Therefore, it is my experience that control numbers have a higher degree of error than an amount field. Data entry errors typically associated with a control number include:

  • adding numeric integers
  • data entry operators entering a portion of a long number
  • documents that do not have a control number
  • substituting a different control number
  • entering the number incorrectly


Although these types of errors usually have no impact on the business process, they can create false positives associated with fraud data analytics testing that uses the control number.

Date errors create problems with sequence testing or speed of processing tests. One simple thing to check is the year in the date field to ensure that it is consistent with the scope of the audit.

Amounts are typically correct although reversal transactions may provide the illusion of duplicate transactions or inflated record counts. In the data cleaning phase, you can search for reversal transactions using exact reversal search techniques to mitigate false positives associated with the amount field. 

The description field is critical for most transaction analysis. When the description field is populated based on internal systems, the field is usually reliable. Examples of internal systems are sales systems that populate a sales invoice from a product description file or the payroll system that populates earning type from an earnings code table. In these systems, the analysis should search for codes that are not consistent with the known codes in the tables.

Description fields that are created from a manual entry process or description fields created from a vendor document or customer document may have a high degree of error or inconsistency as to the information contained in the database.

Data Usability

The outcome of the availability and reliability analysis is to determine the usability of the data for the planned tests. The fraud data analytics project should try to anticipate the type and frequency of errors that will occur.

The usability analysis is a byproduct of the availability and reliability analysis. The purpose of the usability analysis is twofold. First, it determines whether the data have sufficient integrity to ensure the fraud data analytics will provide a meaningful sample. And second, it leads to a decision on how to go forward with the sample results.  The availability and reliability analysis will lead to one of four conclusions:

1. The fraud data analytics plan should be postponed until management improves the internal controls over data entry or enforce adherence to existing internal controls.

2. The transactions containing obvious data error will be extracted from the specific test to eliminate false positives originating from overt data integrity.

3. The degree of error is acceptable and will have minimal impact on the success of the test.

4. The degree of error may create false positives; however, the fraud auditor can resolve the false positives in the audit test phase of the fraud audit.

Assuming the data are deemed usable, the next step is to clean the data consistent with the usability conclusion. The cleaning process includes data formatting and excluding overt data integrity items. A word of caution: I have seen and heard of many interesting data cleaning techniques, and while many of these techniques are quite clever, the cleaning technique does change the original data, which could have an impact on your sample selection or create false positives. One step to avoiding errors is to include the original data next to the clean data so that false positives created from the data cleaning can be detected by visual examination.

Understanding the Data

The word anomaly is defined as an extreme deviation from the norm. The understood data question is all about understanding what is the norm of the data.

The goal in this stage is to understand both gross numbers and transaction type numbers. To illustrate gross numbers in payroll, how many employees are on the database? How many are active? How many are inactive. To illustrate transaction type numbers, how many are paid via direct deposit versus paid with a check. How many are salaried employees versus how many are hourly paid employees?

This stage should create statistical reports summarizing transactions by entity number, control levels, transaction types, or internal codes that are relevant to the business system and planned audit tests. The reports should provide an aggregate dollar level, number of records, maximum dollar, minimum dollar, and average dollar.

You should study these reports to understand the norm of the population and the various subgroups created through internal codes before creating and designing fraud data analytic routines.

Understanding Data from a Fraud Perspective

From a fraud perspective, is data just data? An address field in the vendor master file, customer database or your human resources system is still just an address field. The field is alpha numeric; it contains a number and a description of a physical location. The issue is not the address field; the issue is how to use the address field in the search for fraud risk statements.

At the risk of repeating myself from blog number two, the fraud data analytics planning reports are designed to tell the fraud auditor at a high level of the probability that the fraud risk statement is occurring in the business systems. The reports are generally not sufficiently detailed to identify a fraud risk statement. The probability is based simply on the fact that transactions exist in the data set that on a high level are consistent with the fraud data profile for the risk statement.

The data question is based on searching for patterns and frequencies consistent with the fraud risk statement. In essence, if there are no data consistent with the fraud risk statement data profile, then the fraud risk statement is less likely to be occurring in the audit scope.

The data approach is designed to point the fraud auditor in the right direction based on data versus residual control risk. It is not my desire to debate which approach is appropriate. I will let the profession debate the question. What is critical is to ensure that fraud auditors focus their resources at the fraud risk statements that have the highest likelihood of occurring within your audit.


Sign up now to have this blog delivered to your inbox and read the rest of the series.

Demystifying Fraud eBook CTA

At Fraud Auditing Inc. we have over 38 years of diversified experience. Contact us today if you need help building a comprehensive fraud audit program to detect complex fraud schemes.

Topics: Fraud Data Analytics, Fraud Risk Statements, Fraud Auditing, Fraud Detection, Fraud Definitions, Worked Example

Leonard W. Vona

Written by Leonard W. Vona

Leonard W. Vona has more than 40 years of diversified fraud auditing and forensic accounting experience. His firm, Fraud Auditing, Inc., advises clients in areas of fraud risk assessment, fraud data analytics, fraud auditing, fraud prevention and litigation support.