4 Statistics and Sampling
CHAPTER 4 STATISTICS INVOLVES THE STUDY of research designs, the collection of the data, describing the data, analyzing the data, and then forming a conclusion. We are interested mainly in the analysis of data that has already been collected for us by various business systems. We hope to be able to arrive at various conclusions after analyzing the data. Understanding some basic statistics allows you to understand the makeup or distribution of your data files. This is especially useful when your data file is large and contains millions of records. There are various types of statistical analysis, but the two major categories are descriptive statistics and inferential statistics. Descriptive statistics is where you describe information from the data set. It is used to summarize the data. Where the data have categories, they can be summarized in each group as to frequency or as a percentage that is a relative frequency. With numerical data, we determine the middle of the data or spread of how close or far the numbers are from that middle. We can determine ranges and possibly determine relationships between two variables. Data can also be summarized to ranges. There are two main types of data: categorical (qualitative data) or numerical (quantitative data). Categorical data in a record describes qualities or characteristics of the record. For example, in a payroll record, the division or area field that the employee works in is categorical even if the division is coded as a number. Whether the employee is salaried or paid hourly is another categorical data field. Using pivot tables is a great way to see two categorical variables at once in a summarized fashion. Numerical data includes items such as counts, amounts, or quantities in the record fields. Only actual numerical or quantitative data represent real numbers where calculations are done that make sense. It would not make sense to perform any mathematical operations on items represented by numbers in categorical fields. Ordinal data is a third type that is a hybrid. The data is in categories, but the categories have a meaningful order. These can be analyzed as categorical data or, if the categories are represented by meaningful numeric values, basic calculations may be performed also. An example would be ranking or ratings such as 1 for poor, 2 for average, and 3 for superior. Inferential statistics is the use of sample statistics to make inferences about the population parameters. A parameter is a characteristic of the population. Suppose you sampled 100 records out of the entire data set or population. Assuming that the 100 samples are representative of the total population, you can take the results of those 100 samples and infer the results over the entire population or data set. Let us say that the 100 records were from the sales data set of a retail store, and you wish to pull and examine the sales invoices to attempt to determine whether the customer was female or male. This information would help the store better target their advertising. If the results were that there were 35 sales to males and 65 sales to females, you would conclude that females made 65 percent of all purchases, as you cannot review every sales invoice. In this case, the statistic is 65 percent and the parameter is females. With data analytic software, you can perform calculations on the entire data set or population. Therefore, we will focus on descriptive statistics whether the data fields are categorical or numerical. Inferential statistics are discussed more in the sampling segment of this chapter. In a data set, you want to know where the middle of the data is and what the typical or frequent value is. The most common way to summarize numerical data is to determine its center by describing the mean or average and the median. The mean is merely a term for the average of all the numbers. In terms of the dataset, it would be the total of all the numbers in a particular field divided by the number of records. The mean amount may not even appear in a transaction in the data set as it is a calculated amount. For a PAID_AMOUNT field of 1,000 records, where the total of those records is $250,000, the mean (or average) is $250. IDEA’s field statistics show averages for each numeric field. When you summarize a file in IDEA, you may also select for it to output the average for each key or group in the newly summarized file. You need to take care when considering the mean. It is very sensitive to extremes or outliers. A few very large or a very few small amounts may make the mean not representative of the data. If there was a single transaction of $200,000 in the previous example and you exclude this outlier, the mean would be $50 rather than $250. The median measurement is not sensitive to outliers. The median is the midpoint in the distribution of the data. It is the point that divides the distribution into two, with one half being equal to or less than the median and the other half being equal to or greater than the median. Again, there may not be an actual record amount that corresponds to the median amount. It is merely a positional value. Data must be arranged as indexed or sorted, in either descending or ascending order, before you can successfully apply the formula. Once the data is ordered, determine if there are an odd or even number of records. If the data contains an odd number of records, then the median is the one exactly in the middle of the ordered records. In the example of numbers 1, 2, 4, 5, and 5, the middle or median position number is the third number, which has a value of 4. If the data contains an even number of records, the median is the average of the two numbers appearing in the middle. In the example of numbers 1, 2, 4, 7, 8, and 8, the two middle numbers are 4 and 7. The median position is between 4 and 7, which is the 3.5 spot. By adding those two middle numbers and dividing by 2, the result of 5.5 is the median value. To calculate the position of the median, you may use this formula: Median = (N +1)/2 The letter N (in uppercase) represents the number of records in the field or population. A lowercase letter n represents the number of cases in a sample. This is used when the median position is needed in a sample. In applying the formula for the odd number of the five records above, the median position calculation is (5 + 1)/2 = 3 with a value of 4 in that position. For the even number example of six records, the median position calculation is (6 + 1)/2 = 3.5, with a value of 5.5 in that position. When the mean and median values are far from each other, it is good to be aware of both values. You now know that there are outliers in the data that need to be addressed. Along with the mean and median, the mode is frequently mentioned as a measure of center. The mode refers to the most frequently occurring value in a distribution or data set. It is determined by counting the frequency of each result. In the discussion of Benford’s Law in Chapter 5, it can be seen that the leading digit 1 for the first-digit test is the mode or the number that most frequently appears in data sets. While we will not be using the mode for any calculations, you should be aware of what it represents. In order to understand the data, you need to be aware of the dispersion or variability of the data. To do so, you need more information than just the mean. Different distributions may have the same mean, so the mean alone is not that informative. The range measures variability and is merely the lowest item and the highest item. IDEA’s field statistics displays a minimum value and a maximum value, which is the range of the data in that particular field. The range gives some indication of the distribution of the data. Certainly, it advises you of the extremes. For example, data in a particular accounts payable file may have a range from $9.00 to $1,004,462.00. While this may not be meaningful alone, additional information will give you a better sense of the data. More useful types of ranges might include transaction numbers, check numbers, or transaction date starting and ending information. Measuring variability is the determination of how far the data is spread out. It is the extent to which data points in a data set diverge from the mean or average value. Knowledge of the variability of the data provides the auditor or investigator with a sense of whether transactions are outside of the normal area or pattern. Deviation from the mean is how far a number is away from the mean of the distribution. This is calculated by subtracting the mean from the individual numbers. Some numbers will be below the mean, resulting in negative differences, and some will be above the mean, resulting in positive differences. The total or sum of these differences will always net to zero. This is also known as the average deviation that describes (on the average) how far each number is from the mean. This is calculated by subtracting the mean from the individual numbers, but the differences are in absolute values—that is, the negative signs are ignored. The sums of these differences are calculated and then divided by the number of records in the distribution. The variance is a measure of dispersion that eliminates the issue of differences totaling to zero and also the issue of negative numbers. You calculate it by squaring each of the differences, taking the total of the squared differences, and then dividing that total by the number of records. IDEA’s field statistics provides the population variance information of the dataset. The most common measure of variability is the standard deviation. It is the distance of the number from the center or average. The standard deviation is the square root of the variance. You calculate it by squaring each of the differences, taking the total of the squared difference, dividing that total by the number of records, and then applying the square root to the resulting number. The standard deviation tells us the variability in the distribution of the data. It tells us how far each number is away from the mean. The further the number deviates from the mean, the larger the standard deviation amount. It can be used as a measure of relativity between the numbers. This can be also used as a comparison to different data sets. Since standard deviation is relative, it eliminates issues of comparing difference scales or bases as a ratio is calculated. If you were comparing test scores, whether they are calculated out of 100 or 125, the standard deviation can be compared without any additional calculations. The standard deviation calculation discussed earlier is for a population standard deviation. The total of the squared differences is divided by the number of records or n. When calculating the standard deviation of a sample, the total of the squared differences is divided by the number of records minus 1 or n – 1. Subtracting 1 is a correction factor. The reduction in the denominator results in a larger standard deviation in a sample and should be more representative of the true standard deviation of the population. The accuracy of the standard deviation of a sample increases with the increase of the sample size. As you increase the sample size, you get closer to having a standard deviation that is the standard deviation of the population. As you increase the sample size, or n, the correction factor has less impact. Applying division with a denominator of, say 4 (5 – 1, where the sample size is 5) would have a greater impact than dividing by 499 (500 – 1, where the sample size is 500). IDEA’s field statistics provides both the population and sample deviation information of the dataset. Z-scores or standard scores tell us how far a number is away from the mean. It is a good example of how it applies the standard deviation in the formula. Z-scores are discussed further along in the book. Merriam-Webster defines sampling as: the act, process, or technique of selecting a representative part of a population for the purpose of determining parameters or characteristics of the whole population.1 Simply put, sampling is a process of selecting a subset of the population or a number of records from the data set for the purpose of making inferences or conclusions to the entire population or data set. Audit sampling is the audit procedure of examining of a portion of items within a class of transactions in order to evaluate one or more characteristics of that entire class. Either statistical or nonstatistical sampling methods may be used against a part of the entire data set to make a conclusion regarding the entire data set. Sampling is effective when the audit procedure or step does not require a 100 percent review of the population of the class, but a decision or conclusion is required and it is not cost effective to audit 100 percent of the transactions. Statistical sampling uses statistical mathematical calculations for selecting and then evaluating a sample from the data set. Statistical sampling outlines in numeric terms the parameters and precision levels associated with the sample conclusion. One such use of statistical sampling that we are most familiar with are polls used to determine candidates’ current standings in upcoming elections or in popularity surveys. In a November 2013 poll, Toronto’s mayor Rob Ford maintained his 42 percent approval rating after he “admitted he has smoked crack, bought illegal drugs, and might have driven drunk,”2 among other issues. The poll of 1,049 Toronto residents determined the 42 percent rating with accuracy results of plus or minus 3 percent, 19 times out of 20. The 19 times out of 20 translates to a confidence level of 95 percent, so the 1,049 sample results can be applied to the general population of Toronto with a confidence level of 95 percent that the approval rating is between 39 percent and 45 percent. An auditor may be verifying an account balance through statistical sampling and conclude that it is $100,000 plus or minus 5 percent or $5,000 each way ($95,000 to $105,000), 19 times out of 20. The conclusion would be that given the precision of the sample at 5 percent (plus or minus $5,000), there is assurance that the balance is correct with a confidence level of 95 percent. In addition, if the materiality was predetermined to be at 7 percent, then it can be concluded that there would no material error based on the precision level. Confidence level is the remaining factor when the acceptable sampling risk is eliminated. In the example of selecting a 95 percent confidence level, you allow only a 5 percent chance of getting the wrong sample that does not adequately represent the entire population. Using statistical samples to obtain familiarity with the data set might be useful but if it is not used to reach a conclusion, it cannot be considered as part of the audit procedure. In addition to formulating a conclusion, statistical sampling must use statistical calculations, and the sample must be random. Proper use of statistical sampling is beneficial because it: Nonstatistical sampling does not involve the use of statistical calculations. It relies on the subjective sampling selections by the auditor and has less of a standardized approach. Nonstatistical sampling is beneficial where: In order to effectively perform nonstatistical sampling, the auditor must have a good knowledge of the data set. Knowing the contents of the data or population allows for a supportable sample selection choice and also supports the conclusion of the results. Sample selection may be based on random sampling or other nonmathematical techniques such as judgmental, haphazard, or block selection. Judgmental selection is frequently used when the auditor is very experienced and selects samples based on sound judgment. Typically, the auditor will make the selections based on a combination of representativeness of the population, value of the items, and relative risk of the items. Haphazard selection is where the auditor picks items without basis of any mathematical formula. The auditor believes that the items selected are representative of the population and no intentional bias was applied to any of the included or excluded items. Block selection is where a contiguous sequence of items is selected as samples. These blocks may be invoice numbers from 1000 to 1100 or a specific type of transactions for the month of March. Block selection effectiveness can be much improved by sampling several blocks. The sample selected either through statistical sampling or nonstatistical sampling methods might not truly reflect the population even if done with the utmost care. This is the cause for sampling risk, where the auditor’s conclusion based on the selected sample may differ from the reality of the conditions of the entire population of the data set. Sampling risk occurs due to limited time and resources that prevent an audit of the entire population. Alpha or Type I risk is the risk of incorrect rejection. That is, the auditor incorrectly concludes from the sample that the population errors are worse than they actually are.
Statistics and Sampling
DESCRIPTIVE STATISTICS
INFERENTIAL STATISTICS
MEASURES OF CENTER
MEASURE OF DISPERSION
MEASURE OF VARIABILITY
Deviations from the Mean
The Mean Deviation
The Variance
The Standard Deviation
Standard Deviation of a Population versus Standard Deviation of a Sample
SAMPLING
Statistical Sampling
Nonstatistical Sampling
Sampling Risk