# Nonparametric Tests

The hypothesis testing presented in my previous two posts presents a number of tests of hypothesis for continuous, dichotomous, and discrete outcomes. Tests for continuous outcomes focused on comparing means, while tests for dichotomous and discrete outcomes focused on comparing proportions. All of the tests presented in the modules on hypothesis testing are called parametric tests and are based on certain assumptions. For example, when running tests of hypothesis for means of continuous outcomes, all parametric tests assume that the outcome is approximately normally distributed in the population. This does not mean that the data in the observed sample follows a normal distribution, but rather that the outcome follows a normal distribution in the full population which is not observed. For many outcomes, investigators are comfortable with the normality assumption (i.e., most of the observations are in the center of the distribution while fewer are at either extreme). It also turns out that many statistical tests are robust, which means that they maintain their statistical properties even when assumptions are not entirely met. Tests are robust in the presence of violations of the normality assumption when the sample size is largely based on the Central Limit Theorem. When the sample size is small and the distribution of the outcome is not known and cannot be assumed to be approximately normally distributed, then alternative tests called nonparametric tests are appropriate.

### Parametric vs. Nonparametric Tests

Parametric implies that distribution is assumed for the population. Often, an assumption is made when performing a hypothesis test that the data is a sample from a certain distribution, commonly the normal distribution. Nonparametric implies that there is no assumption of a specific distribution for the population. An advantage of a parametric test is that if the assumptions hold, the power, or the probability of rejecting H0, when it is false, is higher than the power of a corresponding nonparametric test with equal sample sizes. An advantage of nonparametric tests is that the test results are more robust against violation of the assumptions. Therefore, if assumptions are violated for a test based upon a parametric model, the conclusions based on parametric test p-values may be more misleading than conclusions, based upon nonparametric test p-values.

Nonparametric tests are sometimes called distribution-free tests because they are based on fewer assumptions (e.g., they do not assume that the outcome is approximately normally distributed). Parametric tests involve specific probability distributions (e.g., the normal distribution) and the tests involve estimation of the key parameters of that distribution (e.g., the mean or difference in means) from the sample data. The cost of fewer assumptions is that nonparametric tests are generally less powerful than their parametric counterparts (i.e., when the alternative is true, they may be less likely to reject H0).

It can sometimes be difficult to assess whether a continuous outcome follows a normal distribution and, thus, whether a parametric or nonparametric test is appropriate. There are several statistical tests that can be used to assess whether data are likely from a normal distribution. The most popular is the Kolmogorov-Smirnov test, the Anderson-Darling test, and the Shapiro-Wilk test. Each test is essentially goodness of fit test and compares observed data to quantiles of the normal (or other specified) distribution. The null hypothesis for each test is H0: Data follow a normal distribution versus H1: Data do not follow a normal distribution. If the test is statistically significant (e.g., p<0.05), then data do not follow a normal distribution, and a nonparametric test is warranted. It should be noted that these tests for normality can be subject to low power. Specifically, the tests may fail to reject H0: Data follow a normal distribution when in fact the data do not follow a normal distribution. Low power is a major issue when the sample size is small – which unfortunately is often when we wish to employ these tests. The most practical approach to assessing normality involves investigating the distributional form of the outcome in the sample using a histogram and augmenting that with data from other studies, if available,  that may indicate the likely distribution of the outcome in the population. There are some situations when it is clear that the outcome does not follow a normal distribution. These include situations:

• when the outcome is an ordinal variable or a rank,
• when there are definite outliers or
• when the outcome has clear limits of detection.

### Nonparametric Techniques

Nonparametric techniques of hypothesis testing are applicable for many quality engineering problems and projects. The nonparametric tests are often called “distribution-free” since they make no assumption regarding the population distribution. Nonparametric tests may be applied ranking tests in which data is not speciﬁc in any continuous sense, but are simply ranks. Parametric tests are generally more powerful and can test a wider range of alternative hypotheses. It is worth repeating that if data are approximately normally distributed then parametric tests (as in the modules on hypothesis testing) are more appropriate. However, there are situations in which assumptions for a parametric test are violated and a nonparametric test is more appropriate.

In nonparametric tests, the hypotheses are not about population parameters (e.g., μ=50 or μ12).   Instead, the null hypothesis is more general.   For example, when comparing two independent groups in terms of a continuous outcome, the null hypothesis in a parametric test is H0: μ12. In a nonparametric test, the null hypothesis is that the two populations are equal, often this is interpreted as the two populations are equal in terms of their central tendency.

Nonparametric tests have some distinct advantages. With outcomes such as those described above, nonparametric tests may be the only way to analyze these data. Outcomes that are ordinal, ranked, subject to outliers or measured imprecisely are difficult to analyze with parametric methods without making major assumptions about their distributions as well as decisions about coding some values (e.g., “not detected”). As described here, nonparametric tests can also be relatively simple to conduct.

Continuous data are quantitative measures based on a specific measurement scale (e.g., weight in pounds, height in inches). Some investigators make the distinction between continuous, interval and ordinal scaled data. Interval data are like continuous data in that they are measured on a constant scale (i.e., there exists the same difference between adjacent scale scores across the entire spectrum of scores). Differences between interval scores are interpretable, but ratios are not. The temperature in Celsius or Fahrenheit is an example of an interval scale outcome. The difference between 30º and 40º is the same as the difference between 70º and 80º, yet 80º is not twice as warm as 40º. Ordinal outcomes can be less specific as the ordered categories need not be equally spaced. Symptom severity is an example of an ordinal outcome and it is not clear whether the difference between much worse and slightly worse is the same as the difference between no change and slightly improved. Some studies use visual scales to assess participants’ self-reported signs and symptoms. Pain is often measured in this way, from 0 to 10 with 0 representing no pain and 10 representing agonizing pain. Participants are sometimes shown a visual scale such as that shown in the upper portion of the figure below and asked to choose the number that best represents their pain state. Sometimes pain scales use visual anchors as shown in the lower portion of the figure below. In the upper portion of the figure, certainly, 10 is worse than 9, which is worse than 8; however, the difference between adjacent scores may not necessarily be the same. It is important to understand how outcomes are measured to make appropriate inferences based on statistical analysis and, in particular, not to overstate precision.

### Assigning Ranks

The nonparametric procedures that we describe here follow the same general procedure. The outcome variable (ordinal, interval, or continuous) is ranked from lowest to highest and the analysis focuses on the ranks as opposed to the measured or raw values. For example, suppose we measure self-reported pain using a visual analog scale with anchors at 0 (no pain) and 10 (agonizing pain) and record the following in a sample of n=6 participants:

7               5               9              3             0               2

The ranks, which are used to perform a nonparametric test, are assigned as follows: First, the data are ordered from smallest to largest. The lowest value is then assigned a rank of 1, the next lowest a rank of 2, and so on. The largest value is assigned a rank of n (in this example, n=6). The observed data and corresponding ranks are shown below:

A complicating issue that arises when assigning ranks occurs when there are ties in the sample (i.e., the same values are measured in two or more participants). For example, suppose that the following data are observed in our sample of n=6:

Observed Data:       7         7           9            3           0          2

The 4th and 5th ordered values are both equal to 7. When assigning ranks, the recommended procedure is to assign the mean rank of 4.5 to each (i.e. the mean of 4 and 5), as follows:

Suppose that there are three values of 7.   In this case, we assign a rank of 5 (the mean of 4, 5 and 6) to the 4th, 5th and 6th values, as follows:

Using this approach of assigning the mean rank when there are ties ensures that the sum of the ranks is the same in each sample (for example, 1+2+3+4+5+6=21, 1+2+3+4.5+4.5+6=21, and 1+2+3+5+5+5=21). Using this approach, the sum of the ranks will always equal n(n+1)/2. When conducting nonparametric tests, it is useful to check the sum of the ranks before proceeding with the analysis.

To conduct nonparametric tests, we again follow the five-step approach outlined in the modules on hypothesis testing.

1. Set up hypotheses and select the level of significance α. Analogous to parametric testing, the research hypothesis can be one- or two-sided (one- or two-tailed), depending on the research question of interest.
2. Select the appropriate test statistic. A test statistic is a single number that summarizes the sample information. In nonparametric tests, the observed data is converted into ranks and then the ranks are summarized into a test statistic.
3. Set up decision rule. The decision rule is a statement that tells under what circumstances to reject the null hypothesis. Note that in some nonparametric tests we reject H0 if the test statistic is large, while in others we reject H0 if the test statistic is small. We make the distinction as we describe the different tests.
4. Compute the test statistic. Here we compute the test statistic by summarizing the ranks into the test statistic identified in Step 2.
5. Conclusion. The final conclusion is made by comparing the test statistic (which is a summary of the information observed in the sample) to the decision rule.   The final conclusion is either to reject the null hypothesis (because it is very unlikely to observe the sample data if the null hypothesis is true) or not to reject the null hypothesis (because the sample data are not very unlikely if the null hypothesis is true).

Three powerful nonparametric techniques will be described with examples: Kendall Coefficient of Concordance, Spearman Rank Correlation Coefficient (rs). and Kruskal- Wallis one-way analysis of variance.

### Kendall Coefficient of Concordance

Example: At a textile plant, some years ago, the primary product was denim. An important customer characteristic was “hand.” That is, how the fabric drapes and feels to the touch. Traditionally, the hand was evaluated by individuals (judges or inspectors) who had become experts over time by literally handling the fabric. The lab manager had obtained, on trial from the vendor, a “handleometer,” an instrument to objectively measure hand. She believed that the current subjective procedure for determining hand was too insensitive to change and ineffective in establishing a common customer specification. The plant manager, two department heads, and a product engineer (the plant judgment panel) were opposed to the handleometer. They said that the handleometer only measures the bending moment of fabric while they recognized multidimensional aspects of hand: stiffness, friction, drape, etc. The two measuring systems were compared using an analytic technique to determine whether the four-panel members represented a statistically homogeneous decision-making group. Secondly, to correlate the panel average ranking with the handleometer ranked values. Ten random samples from production were obtained.  The panel members were to independently rank them from most to least with no ties  (although the expanded procedure permits ties), 10 samples are to be independently ranked by 4 judges or inspectors for the sensory response variable, hand. The null hypothesis is that the judge’s rankings are independent of each other. The judges independently ranked the samples for the characteristic specified. The Kendall statistics are calculated:

• Each judge ranks the samples from 1 to 10 (rank 1 is most hand)
•  Sum the ranks of each judge (ΣR)
•  Determine the average rank
• Subtract the rank sum for each judge from the average rank (ΣR)
• Square the rank sum differences (ΣR ) 2
•  Sum the squares of the rank sum differences Σ[ΣR ] 2)

R̅= 220/10 = 22                    s = 1066                                        K = Judges = 4 N = Samples =10
Degrees of freedom = ν = N – 1 = 9                                                            Critical chi square = χ20.01,9= 21.67
The null hypothesis is rejected. The calculated chi-square is larger than the critical chi-square. The four judges’ rankings are not independent of each other. They constitute a homogeneous panel. This does not say that they are incorrect; only that they respond in a uniform way to this form of sensory input.

### The Spearman Rank Correlation Coefficient (rs)

The Spearman correlation coefficient is a measure of association that requires that both variables be measured in at least an ordinal scale so that the samples or individuals to be analyzed may be ranked in two ordered series. If one of the series is continuous and one is ranked, then both series must be ranked. If both series contain continuous data from an unknown distribution, both series must be ranked.
Example: The ten rank sums from the Kendall coefficient example are ranked from largest to smallest. The rank numbers from 1 through 10 are then assigned to the ranked panel sums. For the same samples, the handleometer values are ranked and then assigned the integer values from 1 through 10. The differences between the paired ranks are squared and summed.

N = 10. If N is equal to or greater than 10, the following correlation equation can be used.

The strong correlation (0.97) between the ranked handleometer variable measurements and the ranked panel subjective sensory responses of judges, shows that the handleometer could replace people. The handleometer values can be obtained more quickly, with greater objectivity, and with a longer life span than individuals. The lab manager was disappointed to learn that the instrument would not be purchased, due to the objections presented before the analysis.

### Kruskal-Wallis One-Way Analysis of Variance by Ranks

This is a test of independent samples. The measurements may be continuous data, but the underlying distribution is either unknown or known to be non-normal. In either case, the data can be ranked and analyzed without the constraint of having to assume a known population distribution.
Example: Three different plants manufactured the same garment style. Variation in garment length was a customer concern. The length was measured to the nearest 1/4″. Within each plant, only four measurement increment values were obtained. This lack of measurement sensitivity indicated that ranking the data was preferred to assuming normality. The null hypothesis is that the population medians are the same. Ho: M1 = M2 =… = Mn. The following table shows data coded as deviations from a common reference value.

Original Data Measurements (Coded)

For simplicity and convenience, the coded data can be further coded as integers.

The next step is to construct a combined sample, rank the combined data while retaining plant identity, and reconstitute the three plant sample sets with ranks replacing the original data. Tied ranks are replaced by the average value of the ties.
There were seven coded values tied at 4. They would have been ranks 1 through 7. The average of ranks 1 – 7 is 4. All coded measurement values of 1 received the average rank of 4. In a similar fashion, the five coded values tied at 2 received the average rank of 10. The three coded values tied at 3 received the average rank of 14,
and the six coded values tied at 4 received the average rank of 18.5. Reconstitute the original sample sets of coded data of plants A, B, and C with the final tied ranks. In some applications, there may be both individual ranks and tied ranks. Wherever there are tied ranks, they are to be used. Now do the following analysis for plant columns A, B, and C.

G = ∑(Rank Sum)2/n = 693.781 + 495.042 + 1486.286 = 2675.109   N = 8 + 6 + 7 = 21
The significance statistic is H. H is distributed as chi-square. Tie values are included in the calculation of chi-square.
Let t = number of tied values in each tied set. Then T = t3 – t for that set.

Let J = ∑T = 690 Let k = number of sample sets.  DF = k – 1 = 3 – 1 = 2.      Let α = 0.05.
Critical chi square =χ20.05,2 = 5.99
H is less than critical chi-square. Therefore, the null hypothesis of equality of population medians cannot be rejected.

### Mann-Whitney U Test

With ordinal measurements, the Mann-Whitney U test is used to test whether two independent groups have been drawn from the same population. This is a powerful nonparametric test and is an alternative to the t-test when the normality of the population is either unknown or believed to be non-normal. Consider two populations, A and B. The null hypothesis, Ho, is that and B have the same frequency distribution with the same shape and spread (the same median). An alternative hypothesis, H1, is that A is larger than B, a directional hypothesis. We accept H1, if the probability is greater than 0.5 that a score from A is larger than a score from B. That is, if a is one observation from population A, and b is one observation from population B, then H1 is that P (a > b) > 0.5. If the evidence from the data supports H1, this implies that the bulk of the population is higher than the bulk of population B. If we wished to test if B is statistically larger than A, then H1 is P (a > b) < 0.5. For a 2-tailed test, that is, for a prediction of differences that does not state direction, H1 would be P (a > b) ≠ 0.5 (the medians are not the same).

If there are n1, observations from population A, and n2 observations from population B, rank all (n1 + n2) observations in ascending order. Ties receive the average of their rank number. The data sets should be selected so that n1<n2. Calculate the sum of observation ranks for population A, and designate the total as Ra, and the sum of observation ranks for population B, and designate the total as Rb.

Ua=n1 n2+O.5n1(n1+1)-Ra
Ub=n1 n2+0.5 n2( n2+1)-Rb
Where Ua + Ub = n1 n 2

Calculate the U statistic as the smaller of Ua and Ub. For n2≤ 20, Mann-Whitney tables are used to determine the probability, based on the U, n1, and n2 values. This probability is then used to reject or fail to reject the null hypothesis. If n2 > 20, the distribution of U rapidly approaches the normal distribution and the following apply:

Umean = µu = 0.5 n1 n2Example: Consider an experimental group (E) and a control group (C) with scores as shown in Table below. Note that n1 = 3 and n2 = 4. Does the experimental group have higher scores than the control group? Ho: A and B have the same median. H1: median A is larger than median B. Accept H1: if P (a > b) > 0.5.To find U, we first rank the combined scores in ascending order, being careful to retain each score’s identity as either an E or C
U = minimum(Ue, Uc) = minimum(3, 9) = 3. The Ho probability for n1 = 3, n1 = 4, and U = 3 is shown in Table below as P = 0.200. Since this is less than 0.5, we fail to reject Ho and conclude that scores for both groups have come from the same population. The probabilities in the Tables given below are one-tailed. For a two-tailed test, the values for P shown in the Table should be doubled.

### Wilcoxon-Mann-Whitney Rank Sum Test

The Wilcoxon-Mann-Whitney rank-sum test is similar in application to the Mann- Whitney Test. The null hypothesis is that the two independent random samples are from the same distribution. The alternate hypothesis is the two distributions are different in some way. Note that this test does not require normal distributions.
The observations or scores of the two samples (A and B) are combined in order of increasing rank and given a rank number. Tied values are assigned tied rank values. In cases where equal results occur, the mean of the available rank numbers is assigned. Next find the rank-sum, R, of the smaller sample. Let N equal the size of the combined samples (N = n1 + n2) and n equal the size of the smaller sample. Then calculate:     R’ = n (N + 1) – R
The rank-sum values, R and R’, are compared with critical values from the Table below. It represents critical values of the smaller rank-sum. If either R or R’ is less than the critical value, the null hypothesis of equal means is rejected. If n2 > 20, the equations from the U test given above are used for the Z calculation.
Example: Determine if the data from samples A and B  have the same distribution. The null hypothesis, Ho, is the data from samples A and B have the same median. The alternate hypothesis, H1, is A median is larger than B median.nA=9, nB=10,N=19, R=77 and R’=n(N+1)-R=(9)(20)-77=103 ,
Let α = 0.05 for a one-tailed test. From the Table below the critical value is 69. Since R = 77 is larger than 69, we fail to reject the null hypothesis of equal means. If H1 had been A median is different than B median, then a two-tailed test would have been used.

Wilcoxon-Mann-Whitney Critical values

### Levene’s Test

Levene’s test is used to test the null hypothesis that multiple population variances (corresponding to multiple samples) are equal. Levene’s test determines whether a set of k samples have equal variances. Equal variances across samples are called homogeneity of variances. Some statistical tests, i.e. the analysis of variance, assume that variances are equal across groups or samples.  The Levene test can be used to verify that assumption. Levene’s test is an alternative to the Bartlett test. The Levene test is less sensitive
than the Bartlett test to depart from normality. If there is strong evidence that the data does in fact come from a normal, or approximately normal, distribution, then Barlett’s test has better performance. The well-known F test for the ratio between two sample variances assumes the data is normally distributed. Levene’s variance test is more robust against departures from normality. When there are just two sets of data, the Levene procedure is to:

1. Determine the mean
2. Calculate the deviation of each observation from the mean
3. Let Z equal the square of the deviation from the mean
4. Apply the t test of two means to the Z data

The methodology for this calculation is remarkably similar to that presented earlier for the 2 mean equal variance t-test. The sample sizes do not need to be equal for Levene’s test to apply.

### Mood’s Median Test

Mood’s Median Test performs a hypothesis test of the equality of population medians in a one-way design. The test is robust against outliers and errors in data and is particularly appropriate in the preliminary stages of analysis. The median test determines whether k independent groups (equal size is not required) have either been drawn from the same population or from populations with equal medians. The first step is to find the combined median for all scores in the k groups. Next, replace each score by a plus if the score is larger than the combined median and by a minus, if it is smaller than the combined median. If any score falls at the combined median, the score may be assigned to the plus and minus groups by designating a plus to those scores which exceed the combined median and a minus to those which fall at the combined median or below. Next set up a chi-square “k x 2” table with the frequencies of pluses and minuses in each of the k groups.

Table I shows the counts of critical defects that occurred in 52 lots from six different styles. Table ll identifies and counts those scores above the combined median. The combined median i determined by pooling all of the Table I values and determining the middle value. In this case, the ordered 26th value is 3 and the 27th value is 4, so the median of all of the values is 3.5.

The (+0) is the number of observed cells with values greater than the median. The (-0) is the number of observed cells with values less than the median. The expected frequency (E) for each style for the number of lots above or below the median for that style, is one-half of the number of lots in that style or N/2.
There are 26 scores(+) above the combined median and 26 scores (- , not shown) below the combined median. To apply the chi-square test and to set up a chi-square table, the Table below shows the chi-square, k x 2 tables where (0) represents the observed frequencies and (E) represents the expected frequencies. Because cell expected frequencies (E) should not be less than 4 (preferably 5), the results of styles K and L are combined. The null hypothesis, Ho, states that all style medians are equal. The alternative hypothesis, H1, states that at least one style median is different. The chi-square calculation over all ten cells is represented by: The degrees of freedom for contingency tables is:
df= (rows – 1) x (columns – 1) = (2 – 1) x (5 – 1) = 4
Assume we want a level of significance (alpha) of 0.05. The critical chi-square:
χ20.05,4 = 9.49
Since the calculated χ2 is less than the critical χ2, the null hypothesis cannot be rejected, at a 0.05 level of significance (or a 95% confidence level).

### Nonparametric Test Summary

For tests of population location, the following nonparametric tests are analogous to the parametric t-tests and analysis of variance procedures in that they are used to perform tests about population location or center value. The center value is the mean for parametric tests and the median for nonparametric tests

1. One-sample sign performs a test of the median and calculates the
corresponding point estimate and confidence interval. Use this test as a nonparametric alternative to the one-sample Z and one-sample t-tests.
2. One-sample Wilcoxon performs a signed-rank test of the median and calculates the corresponding point estimate and confidence interval. Use this test as a nonparametric alternative to the one-sample Z and one-sample t-tests.
3. Mann-Whitney performs a hypothesis test of the equality of two population medians and calculates the corresponding point estimate and confidence interval. Use this test as a nonparametric alternative to the two-sample t-test.
4. Kruskal-Wallis performs a hypothesis test of the equality of population medians for a one-way design (two or more populations). This test is a generalization of the procedure used by the Mann-Whitney test and, like Mood’s median test, offers a nonparametric alternative to the one-way analysis of variance. The Kruskal-Wallis test looks for differences among the population medians.
5. Mood’s median test performs a hypothesis test of the equality of population medians in a one-way design. Mood’s median test, like the Kruskal-Wallis test, provides a nonparametric alternative to the usual one-way analysis of variance. Mood’s median test is sometimes called a median test or sign scores test.

The Kruskal-Wallis test is more powerful (the confidence interval is narrower, on average) than Mood’s median test for analyzing data from many distributions, including data from the normal distribution, but is less robust against outliers.

Comparison Summary of Non Parametric test

It should be noted that nonparametric tests are less powerful (they require more data to find the same size difference) than the equivalent t-tests or ANOVA tests. In general, nonparametric procedures are used either when parametric assumptions cannot be met, or when the nature of the data requires a nonparametric test.