# Knowing the Score

## Statistics can trip you up or trump your opponent's claims. You just need to know how it all adds up.

### By Steven Carter, PsyD, LP

Psychological tests produce a collection of measurements. Simple, direct inspection of that data usually will not allow one to reach valid conclusions. Valid inferences require classification, summary description, and the rules of evidence. Statistics provide those methods.

Using statistics to reach meaningful conclusions about an individual requires inductive inference—reasoning from the specific to the general. For example, we might compare the test scores of John Smith, a 45-year-old male with an alleged acquired brain injury, to a group of English-speaking, 45-year-old males who are known with certainty to have an acquired brain injury. The comparison group is called a sample because the members were selected from a larger group. Psychological evaluators use the deviation between the sample group and larger group to conclude how similar a specific claimant is to the normative sample to which he or she is being compared and to establish how certain one can be about that conclusion.

Determining the extent to which the members of the sample group are similar to the claimant is a key step in any claims dispute that relies upon psychological test data. All reputable and commercially published tests provide a manual with very detailed information about the demographic characteristics of the sample. Those manuals are often available only to licensed professionals; preventing access by the public is intended to prevent claimants from rehearsing their responses to produce a desired outcome. However, the claims professional or attorney is not left bereft of self-help resources.

The scientific literature in PubMed (www.ncbi.nlm.nih.gov) and authoritative reviews, such as

*Buros Mental Measurements Yearbook*(www.unl.edu/buros), provide extensive critiques of the sample quality of all commercially published tests and can quickly inform a claims professional or legal counsel about shortcomings and disputes related to the representativeness of that data when applied to particular individuals. Searching both is free. The lay investigator will find Buros easier because it provides immediate access to multiple comprehensive written critiques of all commonly used tests for $15 per report. PubMed requires sophisticated searching of a research database to identify relevant research studies. You must then must obtain and read each study on your own, but the individual studies will provide a more detailed analysis for the technically sophisticated reader, and these are the studies on which the conclusions in the Buros critiques are based.

Psychologists commonly make the error of using whatever test is handy and whatever normative sample was supplied by the test publisher to reach inferences about the performance of claimants. For example, in one case, the expert compared a Spanish-speaking farmhand with an eighth grade education in a rural Mexican school and a French-speaking nurse in Quebec to English-speaking Americans with a high school education. These samples are not representative, and the validity of the conclusions can be challenged on that basis.

Most variables measured by psychological tests are normally distributed among the tested population. A normal distribution of data is described as "bell-shaped" as shown in Figure 1. In this illustration, the height of the curve (i.e., the Y-axis) indicates the number of times a score was observed, from few (low height) to many (high height). The horizontal axis (i.e., the X-axis) provides the score value from infinitely low to infinitely high. Nearly all commonly used and commercially published test publishers provide a normative sample whose data are distributed in this bell-shaped curve. Being able to compare a claimant's test data to the normal distribution for that test is critical to the analysis of statistical data.

In a normal distribution, the mean, mode and median scores are exactly the same. The mean is the sum of all the scores divided by the number of scores. The mode is the most common score in the sample. The median is the score above which half the scores lie and below which half the scores lie. None of these scores are necessarily representative of any single member of the sample. Instead, they are representative of the entire sample as a group.

Mean scores are notoriously vulnerable to skewing by extreme data. For this reason, median scores are typically used to describe samples with a large range. You might be accustomed to hearing news reports on the "median home price" rather than the "mean home price." That is done so that a very small number of multi-million-dollar homes do not give a distorted view. The problem with both means and medians is that they provide a single number with no indication of the range of the variable under consideration. That information is provided by the standard deviation.

A standard deviation is the average distance of a score from the mean in the units used to measure whatever you are measuring. In Figure 2, the table tells us that a person 30 to 34 years old can, on average, repeat 6.61 digits forward (mean). The median is 7.00 digits, and the standard deviation is 1.35 digits. Since one cannot partially repeat a digit and the results cannot be more precise than the original data, we can round these scores and determine that the typical person in this age group can repeat seven digits plus or minus one (i.e., six to eight digits is normal). The cumulative percentages in the columns tell us that 55% (rounded) of people in this sample can repeat seven digits forward, only 9% can repeat nine or more, and nearly 100% can repeat two digits.

The normal curve allows us to use these basic statistics of the mean, median, mode, standard deviation, and percentage to make much more precise statements. Look again at Figure 1. Notice that 34.13% of the scores are one standard deviation above and below the mean (the highest point on the curve). Doubling this number and rounding, we can see that 68% of the scores in a normally distributed sample are within one standard deviation of the mean. The same process shows us that 95% of the scores are within two standard deviations of the mean (i.e., 34.23 + 13.59 x 2 then rounded). Using rounded cumulative percentages, 98% of the scores are at or below two standard deviations above the mean. These facts and many others are easily derived from Figure 1. Interesting math, but how does it pertain to insurance claims?

Here's one way: The convention in the social sciences is that we cannot say with confidence that a score is significantly different from the average until it is two standard deviations or more from the mean. This convention is used for the most commonly administered tests of intelligence and memory, known cumulatively as the "Wechsler Scales" after David Wechsler, the researcher most responsible for their development. Figure 3 provides a range of standard scores on the Wechsler Scales and typical qualitative descriptions. Standard scores have a mean of 100 and a standard deviation of 15. Notice that the term "impairment" is not used until the score is 69 or below. Why is that?

Answering that question is key to understanding psychological test data and easily answered by reference to Figure 1. Look at the bottom of the figure where we find the "Deviation IQ" scores. Note that, if you trace upward from a value of 70 to the normal curve, it corresponds to exactly two standard deviations below the mean (i.e., 100 – (2 x 15) = 70). Notice that only about 2% of the normative sample lies below that point (2.14% + 0.13%) and 98% of the scores are higher. You can perform the same operations on Wechsler Scales Subtest scores. All commonly used scaled scores have a mean of 10 and a standard deviation of three, so one standard deviation below the mean is 4 (i.e., 10-(2 x 3) = 4). The remaining statistical properties are the same as those for standard scores and IQ scores. This knowledge can be used to independently determine if qualitative statements made by a psychologist or other social scientist are supported by the objective data reported.

Let's address one common error before concluding. Look again at Figure 1. The standard deviations, cumulative percentages, Wechsler Scales Subtests, and the Deviation IQs have both regular and equal intervals. Compare those to the percentile equivalents, which, although regular, are not at equal intervals, and to the cumulative percentages, which have both regular and equal intervals. What is going on?

The comparison illustrates the difference between percentile "ranks" and percentile "points." The percentile equivalents correspond to percentile ranks. Think about two comparisons of two pairs of claimants. In one pair, claimant one scored at the 40th percentile, and claimant two scored at the 50th percentile. In the other pair, claimant one scored at the 80th percentile, and claimant two scored at the 90th percentile. In both cases, the numerical value between the scores is 10 points, but in the second case, the ability difference is much larger. That is because the size of intervals in percentile ranks grows the further the score is from the mean. This means that the further a claimant's score on a psychological test is from the mean, the greater the difference the claimant's performance on each question will have on the percentile rank. At extreme values, even a single moment's hesitation on a timed test could change the rank by several percentage points.

Some degree of uncertainty must attach to all arguments based on psychological test data. That uncertainty itself can be precisely described, making it possible to make rigorous statements about the uncertainty associated with any particular conclusion. This is one of the key contributions that statistics can make to legal arguments involving psychological test data. Readers desiring more information can consult the

*Statistics for Dummies*series of books by Deborah Rumsey or the*Cartoon Guide to Statistics*by Larry Gonick and Woollcott Smith.