At KCIC, it is not uncommon for FedEx to deliver a couple of large boxes overflowing with paper invoices that a client needs us to review. The end goal of the review is to confirm that the total amount supported by the hardcopy evidence in the boxes matches the total amount found on a spreadsheet that the client provides — so that it can be relied upon in trial. The spreadsheet lists all of the invoices that have ever been paid, including amount, date of payment, and name of vendor that was paid. Furthermore, the work needs to be completed by Friday, and it is already nearly 3:00 pm on Monday. Yes, it is going to be a fair amount of work for our team and the deadline is tight, but at KCIC we find a way. Then FedEx asks to use the freight elevator. They have another 50 boxes of invoices for us. Suddenly, what seemed like a stretch goal is now an insurmountable task.
We routinely deal with massive amounts of information. On those occasions when analyzing every data point and document would be exceedingly burdensome — especially under the firm deadlines of litigation — we often turn to inferential statistics to help us reliably gain the information we need.
Simply put, inferential statistics allow us to use probability to make intelligent conclusions about a population, while processing only a sample of the data available. The upside is that the time savings is significant, since sample sizes are typically much smaller than the full population. And, we are still relying on a disciplined approach, since inferential statistics are a long-established mathematical method. The downside is that the client will have to sacrifice a small amount of certainty for the benefit of time and resource savings. In our current example, we checked with the client, and they agreed that using inferential statistics was the best option.
Again, this is a disciplined approach, so the steps are clear:
Deductions in inferential statistics are made about a population. The population is comprised of all the individual items of concern, whether people in a country, fish in the sea, answers to a questionnaire, etc. In other words, the population is the universe of items that can be sampled from.
When defining a population, it’s important to keep the goal in mind. For instance, our example has many different populations: all of the invoices in the spreadsheet, all of the paper invoices in the boxes, or all of the line items on all the invoices, to name a few. Because the goal is to confirm the total amount on the invoice spreadsheet, our population should revolve around the dollars in the spreadsheet. To weight each dollar in the sample evenly, we will choose to use each dollar in the spreadsheet as one item in the population. This makes the total amount on the spreadsheet equal to the population count (rounded to the nearest dollar). Further, each dollar sampled will either have hardcopy documentation or will not. Based on the findings, we will be able to estimate — with a specific level of statistical confidence — the number of dollars in the population that have evidence versus those that do not.
Next begins the statistical analysis: constructing the random sample, determining a sample size, developing an acceptable margin of error and confidence level; and drawing conclusions from the review.
Since we are trying to infer the number of dollars in the spreadsheet that have hardcopy documentation, we are using each dollar as an individual item in the population. In order to ensure that each dollar has the same likelihood of being chosen for the sample, we need to randomize the selection process. Random selection makes the sample more likely to be representative of the total population. This is a necessary precursor to inferring conclusions about the population based on information about the sample. In our example, a dollar from a larger invoice is more likely to be selected than a dollar from a smaller invoice, because there are more dollars from the larger invoice in the population of total dollars. It is easy to imagine how taking a non-random sample could skew the data significantly. For instance, if we looked only at the dollars within the smallest invoices, the findings might be representative of the smaller invoices, but would leave out important differences that may be true for the larger invoices.
One way to draw the random sample is to use a random number generator to generate as many random numbers, between 0 and 1, as there are in the sample size (e.g. if the sample size was 1,000 then we would generate 1,000 random numbers between 0 and 1). Next, we would multiply each random number by the total number of items in the population. Each result would be the ordinal number of each item in the sample. For example, if our population is comprised $1 million, and the first random number generated was 0.3527912, then the 352,791st dollar in our population would be included and reviewed in the sample.
Once the random sample is established, the invoices can be reviewed to confirm the amounts in the spreadsheet. Hopefully by the time we reach this point it’s only Wednesday — we have plenty of time to complete our analysis. In future posts, I will cover the remaining steps: sample sizes, margins of error, levels of confidence, confidence intervals, and conclusions.
Never miss a post. Get Risky Business tips and insights delivered right to your inbox.