In univariate statistics, there are one or more independent variables (X1, X2), and only one dependent variable (Y). Multivariate analysis is concerned with two or more dependent variables, Y1, Y2, being simultaneously considered for multiple independent variables, X1, X2, etc. The manual effort used to solve multivariate problems was an obstacle to its earlier use. Recent advances in computer software and hardware have made it possible to solve more problems using multivariate analysis. Some of the software programs available to solve multivariate problems include SPSS, S-Plus, SAS, and Minitab. This coverage of multivariate analysis can only be considered an introduction to the subject. For more in-depth information the reader is advised to consult other references.
Multivariate analysis has found wide usage in the social sciences, psychology, and educational fields. Applications for multivariate analysis can also be found in the engineering, technology, and scientific disciplines. This element will highlight the following multivariate concepts or techniques:
- Multi-Vari Studies
- Principal components analysis
- Factor analysis
- Discriminant function analysis.
- Cluster analysis
- Canonical correlation analysis
- Multivariate analysis of variance
Multi-Vari Studies
Multi-Vari charts are practical graphical tools that illustrate how variation in the input variables (x’s) impacts the output variable (y) or response. These charts can help screen for possible sources of variation (x’s). There are two types of Multi-Vari studies: 1) Passive nested studies, which are conducted without disrupting the routine of the process, and 2) Manipulated crossed studies, which are conducted by intentionally manipulating levels of the x’s. Sources of variation can be either controllable and/or noise variables. Categorical x’s are very typical for Multi-Vari studies (i.e., short vs. long, low vs. high, batch A vs. batch B vs. batch C). Multi-Vari studies help the organization determine where its efforts should be focused on. Given either historic data or data collected from a constructed sampling plan, a Multi-Vari study is a visual comparison of the effects of each of the factors by displaying, for all factors, the means at each factor level. It is an efficient graphical tool that is useful in reducing the number of candidate factors that may be impacting a response (y) down to a practical number.
In statistical process control, one tracks variables like pressure, temperature, or pH by taking measurements at certain intervals. The underlying assumption is that the variables will have approximately one representative value when measured. Frequently, this is not the case. The temperature in the cross-section of a furnace will vary and the thickness of a part may also vary depending on where each measurement is taken. Often the variation is within the piece and the source of this variation is different from piece-to-piece and time-to-time variation. The multi-vari chart is a very useful tool for analyzing all three types of variation. Multi-Vari charts are used to investigate the stability or consistency of a process. The chart consists of a series of vertical lines, or other appropriate schematics, along a time scale. The length of each line or schematic shape represents the range of values found in each sample set. Variation within samples (five locations across the width) is shown by the line length. Variation from sample to sample is shown by the vertical positions of the lines.
To establish a multi-vari chart, a sample set is taken and plotted from the highest to lowest value. This variation may be represented by a vertical line or other rational schematics. The figure below shows an injection-molded plastic part. The thickness is measured at four points across the width as indicated by arrows.
Three hypothetical cases are presented to help understand the interpretation of multi-vari charts
Interpretation of the chart is apparent once the values are plotted.
The advantages of multi-vari charts are:
- It can dramatize the variation within the piece (positional).
- It can dramatize the variation from piece-to-piece (cyclical).
- It helps to track any time-related changes (temporal).
- It helps minimize variation by identifying areas to look for excessive variation. It also identifies areas not to look for excessive variation.
The table below identifies the typical areas of time and locational variation.
Note, positional variation can often be broken into multiple components:
Nested Designs
Sources of variation for a passive nested design might be:
- Positional (i.e., within-piece variation).
- Cyclical (i.e., consecutive piece-to-piece variation).
- Temporal (time-to-time variation, i.e., shift-to shift or day-to-day).
The y-axis in this figure records the measure of performance of units taken at different periods of time, in a time-order sequence. Each cluster (shaded box) represents three consecutive parts, each measured in three locations. Each of the three charts represents a different process, with each process having the greatest source of variation coming from a different component. In the Positional Chart, each vertical line represents a part with the three dots recording three measurements taken on that part. The greatest variation is within the parts. In the Cyclical Chart, each cluster represents three consecutive parts. Here, the greatest variation is shown to be between consecutive parts. The third chart, the Temporal Chart, shows three clusters representing three different shifts or days, with the largest variation between the clusters.
Nested Multi-Vari Example:
In a nested Multi-Vari study, the positional readings taken were nested within a part. The positions within part were taken at random and were unique to that part; position 1 on part 1 was not the same as position 1 on part 2. The subgroups of three “consecutive parts” were nested within a shift or day. The parts inspected were unique to that shift or day. A sampling plan or hierarchy was created to define the parameters in obtaining samples for the study.
A passive nested study was conducted in which two consecutive parts (cyclical) were measured over three days (temporal). Each part was measured in three locations, which were randomly chosen on each part (positional). A nested Multi-Vari chart was then created to show the results.
The day-to-day variation appears to be the greatest source of variation, compared to the variation within part or part-to-part within a day (consecutive parts). The next step in this study would be to evaluate the process parameters that impact day-to-day variation i.e., what changes (different material lots/batches, environmental factors, etc.) are occurring day to day to affect the process.
Crossed Designs:
Sources of variation for a manipulated crossed design might be:
- Machine (A or B).
- Tool (standard or carbide).
- Coolant (off or on).
Interactions can only be observed with crossed studies. When an interaction occurs, the factors associated with the interaction must be analyzed together to see the effect of one factor’s settings on the other factor’s settings. With fully crossed designs, the data may be reordered and a chart may be generated with the variables in different positions to clarify the analysis. In contrast, passive nested
designs are time-based analyses and therefore must maintain the data sequence in the Multi-Vari chart.
Crossed Design Example:
A sampling plan or hierarchy for a crossed design is shown below:
The coolant was turned “on” or “off” for each of two tools while the tools were being used on one of two machines. Every possible combination was run using the same two machines, the same two types of tools, and the same two coolant settings. The following chart uses these sources to investigate graphically the main effects and interactions of these factors in improving surface finish (lower is better).
It appears that the best (lowest) value occurs with carbide tools using no coolant. The different machines have a relatively small impact. It may also be noted that when the coolant is off, there is a large difference between the two tool types. Because of the crossed nature of this study, we would conclude that there is an interaction between coolant and tool type. The interaction is also apparent in this second chart, which shows the same data but different sorting. Coolant “off” and “carbide tool” is again the lowest combinations. Notice how coolant “on” is now the lowest combination with the standard tool. Hence, the interaction could also be expected here.
Steps to create a Multivariate chart:
Multi-Vari charts are easiest done with a computer, but not difficult to do by hand.
- Plan the Multi-Vari Study.
- Identify the Y to be studied.
- Determine how they will be measured and validate the measuring system
- Identify the potential sources of variation. For nested designs, the levels depend on passive data; for crossed designs, the levels are specifically selected for manipulation.
- Create a balanced sampling plan or hierarchy of sources. Balance refers to equal numbers of samples within the upper levels in the hierarchy (i.e., two tools for each machine). A strict balance of exactly the same number of samples for each possible combination of factors, while desirable, is not an absolute requirement. However, there must be at least one data point for each possible combination.
- Decide how to collect data in order to distinguish between the major sources of variation.
- When doing a nested study, the order of the sampling plan should be maintained to preserve the hierarchy.
- Take data in the order of production (not randomly).
- Continue to collect data until 80% of the typical range of the response variable is observed (low to high). (This range may be estimated from historical data.)
- For fully crossed designs, a Multi-Vari study can be used to graphically look at interactions with factors that are not time-dependent (in which case, runs can be randomized as in a design of experiments).
- Take a representative sample.
It is suggested that a minimum of three samples per lowest level subgroup be taken. - Plot the data.
- The y-axis will represent the scaled response variable.
- Plot the positional component on a vertical line from low to high and plot the mean for each line (each piece). (Offsetting the bar at a slight angle from vertical can improve clarity.)
- Repeat for each positional component on neighboring bars.
- Connect the positional means of each bar to evaluate the cyclical component.
- Plot the mean of all values for each cyclic group.
- Connect cyclical means to evaluate the temporal component.
- Compare components of variation for each component (largest change in y (Δy) for each component).
- Many computer programs will not produce charts unless the designs are balanced or have at least one data point for each combination.
- Each plotted point represents an average of the factor combination selected. When a different order of factors is selected, the data, while still the same, will be re-sorted. Remember, if the study is nested, the order of the hierarchy must be maintained from the top-down or bottom-up of the sampling plan.
- Analyze the results.
Ask Is there an area that shows the greatest source of variation? Are there cyclic or unexpected nonrandom patterns of variation? Are the nonrandom patterns restricted to a single sample or more? Are there areas of variation that can be eliminated (e.g., shift-to-shift variation)?
Example:
Several ribbons, one-half short and one-half long and in four colours (red, white, blue, and yellow), are studied. Three samples of each combination are taken, for a total of twenty-four data points (2 x 4 x 3). Ribbons are nested within the “length”: ribbon one is unique to “short” and ribbon four is unique to “long.” Length, however, is crossed with colour: “short” is not unique to “blue.” Length is repeated for all colours. (This example is a combination study, nested and crossed, as are many Gauge R&Rs.)
The following data set was collected. Note that there are three ribbons for each combination of length and colour as identified in the “Ribbon #” column.
The ribbons are sorted by length, then colour to get one chart.
- Each observation is shown in coded circles.
- The squares are averages within a given length and colour.
- Each large diamond is the average of six ribbons of both lengths within a colour.
- Note the obvious pattern of the first, second, and third measured ribbons within the subgroups. The short ribbons (length = 1) consistently measure low, while the long ribbons consistently measure high, and the difference between short and long ribbons (Δy) is consistent.
- There is more variation between colours than lengths (Δy is greater between colours than between lengths).
- Also note the graph indicates that while the value of a ribbon is based upon both its colour and length, longer (length = 2) ribbons are in general more valuable than short ribbons. However, a short red ribbon has a higher value than a long yellow one. Caution should be taken here because not much about how the individual values vary relative to this chart is known. Other tools (e.g., hypothesis tests and DOEs) are needed for that type of analysis.
Multi-Vari Case Study
A manufacturer produced flat sheets of aluminium on a hot rolling mill. Although a finish trimming operation followed, the basic aluminium plate thickness was established during the rolling operation. The thickness specification was 0.245″ ± 0.005″. The operation had been producing scrap. A process capability study indicated that the process spread was 0.0125″ (a C_{p} of 0.8) versus the requirement of 0.010″. The operation generated a profit of approximately Rs 20,00,000 per month even after a scrap loss of Rs,2,00,000 per month. Refitting the mill with a more modern design, featuring automatic gauge control and hydraulic roll bending, would cost Rs 80,00,000 and result in 6 weeks of downtime for installation. The department manager requested that a multi-vari study be conducted by a quality engineer before further consideration of the new mill design or other alternatives. Four positional measurements were made at the corners of each flat sheet in order to adequately determine within-piece variation. Three flat sheets were measured in consecutive order to determine piece-to-piece variation. Additionally, samples were collected each hour to determine temporal variation. The pictorial results are as follows The maximum detected variation was 0.010″. Without sophisticated analysis, it appeared that the time-to-time variation was the largest culprit. A gross change was noted after the 10:00 AM break. During this time, the roll coolant tank was refilled.
Actions taken over the next two weeks included re-leveling the bottom back-up roll (approximately 30% of total variation) and initiating more frequent coolant tank additions, followed by an automatic coolant make-up modification (50% of total variation). Additional spray nozzles were added to the roll stripper housings to reduce heat build-up in the work rolls during the rolling process (10%-15% of total variation). The piece-to-piece variation was ignored. This dimensional variation may have resulted from roll bearing play or variation in incoming aluminum sheet temperature (or a number of other sources). The results from this single study indicated, if all of the modifications were perfect, the resulting measurement spread would be 0.002″ total. in reality, the end result was 0.002″ or 0.004″ total, under conditions similar to that of the initial study. The total cash expenditure was Rs 80,000 for the described modifications. All work was completed in two weeks. The specification of 0.245″ ± 0.005″ was easily met.
Principal Components Analysis
Principal components analysis (PCA) and factor analysis (FA) are two related techniques used to find patterns of correlation among many possible variables or subsets of data and to reduce them to a smaller manageable number of components or factors. The researcher attempts to find the primary components, or factors, that account for most of the sources of variance. PCA refers to subsets as components and FA uses the term factors. Grimm states that a minimum of 100 observations should be used for PCA. The ratio is usually set at approximately 5 observations per variable. If there are 25 variables, then the ratio of 5:1 requires 5 observations/variable x 25 variables = 125 observations.
For illustration purposes, five independent variables will be considered in the growth of communities. The investigator wants to know how many of these components really contribute to growth: one, two, three, or all? Perhaps two principal components will explain 95% of the variance. The other three may only contribute 5%. At one time, multivariate analysis required familiarity with linear algebra and matrices. To reduce this manual effort, a statistical software package such as Minitab, SPSS, or S-Plus can be used. Minitab is used in this discussion to display variances and the correlation matrix. Higher correlation values indicate a key linkage of the factors. An example of PCA is presented below in the table below. In this example, an investigator wishes to uncover the principal factors that are important for a community desiring high-tech growth. If there are only a few principal factors accounting for the vast majority of the variance in growth, then communities can focus on these vital few. The independent factors are:
- High tech workers (thousands of workers)
- Entrepreneurial culture (number of startups per year)
- University-industry interactions (measured by projects per year)
- Creative classes (percentage of professionals and knowledge workers)
- Amount of venture capital (in millions of dollars)
The table below shows hypothetical data generated from interviews with community leaders.
For illustration purposes, the Minitab statistical recap of the information is shown in the table below. It provides the correlation matrix. A step-by-step analysis of the Minitab results are as follows:
- A correlation matrix is used to determine the relationship between components.
- Matrices define quantities as eigenvalues and eigenvectors. This is an eigenvalue analysis.
- The eigenvalues are summed and a proportion is calculated. The sum of eigenvalues is 4.9999 (5.0 due to rounding errors). Thus, 3.5856 divided by 4.9999 is 0.717. PC1 contains 71.7% of the variance.
- PC1 and PC2 explain 89.2% of the variance. This may be sufficient for the researcher.
- There are five total components. Pareto analysis indicates two principal components.
- The first PC indicates that there is no clear separation for four components (high tech workers, entrepreneurial culture, university-industry projects, and venture capital). It is up to the researcher to further distinguish this grouping. It could be more related to the need for a critical mass of necessary resources The second PC indicates that “creative class” is the prime component. A closer look at the first principal component may be required since the values are negative. (This was a small illustrative sample.)
- A “scree” plot (similar to a Pareto line chart) is provided by Minitab software to display the “vital few” eigenvalues.
Finally, an equation can be generated for the two principal components via the use of the coefficients.
PC1 = -O.449 (hightec) -O.507 (entre) -0.512 (university) -0.226 (creative) – 0.478 (venture)
PC2 = 0.154 (hightec) +0.189 (entre) +O.80 (university) -O.966 (creative) -0.025 (venture)
Factor Analysis
Factor analysis is a data reduction technique to identify factors that explain variation. It is very similar to the principal components analysis technique. That is factor analysis attempts to simplify complex sets of data, reducing many factors to a smaller set. However, there is some subjective judgment involved in describing the factors in this method of analysis. The output variables are linearly related to the input factors. The variables under investigation should be measurable, have a range, of measurements, and be symmetrically distributed. There should be four or more input factors for each dependent variable. Factor analysis undergoes two stages: factor extraction and factor rotation. The first analysis will distinguish the major factors for further study (extraction). The second stage will rotate the factors, to make them more meaningful. A principal components analysis can be performed on the data to provide a reduction in the number of factors. ( Minitab can also examine the data through a “maximum likelihood” method.) The economic development data from the previous example was channeled through a principal components analysis which indicated that two factors were significant. From this information, a researcher can go back into Minitab, perform a factor analysis for two factors and obtain a correlation matrix. To make sense of the information, note that Factor 1 has the four factors in a grouping (enterprise, university, high tech, and venture) and Factor 2 has the creative class as the prime factor. This is a similar result to the earlier principal components analysis. Again, the first factor has negative readings, so the researcher should examine that grouping more closely for meaning. The communality column indicates whether the chosen variables explain the variability fit very well. The communality numbers are very high. This means that the researcher can state that the two major factors in high technology community development would involve the five studied variables. The data and factors can be rotated (by the software) to view the data from a different perspective. The four rotational methods in Minitab are equimax, varimax, quartimax, and orthomax. Other software has other varieties.
Discriminant Analysis
If one has a sample with known groups, discriminant analysis can be used to classify the observations or attributes into two or more groups. Discriminant analysis can be used as either a predictive or a descriptive tool. The decisions could involve medical care, college success attributes, car loan creditworthiness, or previous economic development issues. Discriminant analysis can be used as a follow-up to the use of MANOVA. The possible number of linear combinations (discriminant functions) for a study would be the smaller of the number of groups -1, or the number of variables. Some assumptions in the discriminant analysis are: the variables are multivariate, normally distributed, the population variances and covariances among the dependent variables are the same, and the samples within the variables are randomly obtained and exhibit independence of scores from the other samples. Minitab provides two forms of analysis: a linear and quadratic discriminant analysis. The linear discriminant analysis assumes that all groups have the same covariance matrix. This is not the case for the quadratic case. In the linear discriminant analysis, the Mahalanobis distance is the measure used to form or classify groups. The Mahalanobis distance is the squared distance (linear measure) from the observation to the group center. The classification into groups is formed by the distance measure. In the quadratic discriminant analysis, the squared distance does not translate to a linear function, but into a quadratic function. The quadratic distance is called the generalized squared distance.
The previous example, which provided information on high technology growth, will be used for a discriminant analysis example. An additional column has been inserted. It is a column used to state that the area is a “new economy” community. For example, a “yes” or “no” will be used to indicate if a community is considered a “new economy” area. The discriminant analysis will correlate the data and verify if the decision was correct.
The Minitab analysis states that the decisions on the grouping were 10 out of 10 (100% correct). That is, the values in the various factors match up enough to place various regions in certain categories.
Discriminant Analysis: New economy versus creative class, entrepreneurial culture, University-industry projects and venture capital.
Linear Method for Response: New Economy Predictors such as creative, entrepre, universi, high tech venture
Summary of Classification: N = 10 N Correct = 10 Proportion Correct = 1.000 (100%)
Squared distance between groups
(also called the Mahalanobis distance):
no | yes | |
No | 0.00000 | 9.17913 |
yes | 9.17913 | 0.00000 |
Linear Discriminant Function for Group:
The above results are Minitab outputs with few adjustments.
Cluster Analysis
Cluster analysis is used to determine groupings or classifications for a set of data. A variety of rules or algorithms have been developed to assist in group formations. The natural groupings should have observations classified so that similar types are placed together. A file on attributes of high achieving students could be grouped or classified by IQ, parental support, school system, study habits, and available resources. Cluster analysis is used as a data reduction method in an attempt to make sense of large amounts of data from surveys, questionnaires, polls, test questions, scores, etc.
The economic development example in the previous discussion will again be used to validate groupings. The two types of groups will be the new economy and not the new economy. The graphic output from the analysis is the classification tree or dendogram. It is a graphic line graph linking variables and groups at various stages.
The Table data will be analyzed by the cluster analysis method. Using Minitab, the first analysis request calls for two groups. (More groups can be used.) It is displayed below. The analysis shows our requested two groupings. However, instead of grouping into our presumed two groups of new economy or not the new economy, the program used an algorithm based on measures of “closeness” between groups. Since the author requested two groups, the final iteration provides two groups. The dendogram in the Figure below provides a visual that San Jose is distinctive and of a higher ranking than the other communities. The dendogram shows that Austin and Seattle are also distinct from the other lower communities. Communities 7, 8, and 9 forms the lowest cluster. This result can be verified by rerunning the analysis and requesting four groupings. Another interesting analysis would be to group the data by the original five factors as shown above Figure indicates that creative class is separated from the other factors in the grouping. In the principal components and factor analysis discussion, creative class was always the major factor listed in the second group, separated from the other four factors. Similar results can be obtained using different multivariate tools.
Canonical Correlation Analysis
Canonical analysis tests the hypothesis that effects can have multiple causes and causes can have multiple effects. This technique was developed by Hotelling in 1935 but was not widely used for over 50 years. The emergence of personal computers and statistical software has led to its fairly recent adoption. Canonical correlation analysis is a form of multiple regression to find the correlation between two sets of linear combinations. Each set may contain several related variables. The relating of one set of independent variables to one set of dependent variables will form linear combinations. The largest correlation values for sets are used in the analysis. The pairings of linear combinations are called canonical variates, and the correlations are called canonical correlations (also called characteristic roots). There may be more than one pair of linear combinations that could be applicable for an investigation. The maximum number of linear combinations would be limited by the number of variables in the smaller set. Most involve only two sets. The canonical correlation coefficient, r_{c}, is similar to the Pearson product-moment correlation coefficient. The rule of thumb is to have values above 0.30. The squared value would represent less than 10% in overlapping variance between pairs of canonical variates. The linear combinations can be determined from linear matrix algebra or statistical software. For instance, SPSS software can test for significance of canonical correlation and will provide several additional tests.
The table below illustrates the correlation of sets of independent variables to sets of dependent variables. An industrial survey can be conducted to see if there is a correlation between the characteristics of a quality engineer to the listed job skills of a quality engineer. There may be a set of variables that are strongly correlated and canonical correlation can be used.
Hotelling’s T^{2} test is a t-test that is used on more than 2 variables at a time. The student t-test can also be used to compare 2 samples at a time, but if it is used to compare 5 samples, 2 at a time, the probability of obtaining a type one error is increased. That is, finding a significant difference when the two samples are the same. If a 5% error is used, the probability of obtaining such an error is 1 – 0.95^{2p}.
Where p is the number of samples. Hotelling’s T^{2} is the preferred and recommended test method.
MANOVA (Multiple Analysis of Variance)
An analysis of variance is used for many independent X variables to solve one dependent Y variable. This method tests whether the mean differences among groups on a single dependent Y variable is significant. For multiple independent X variables and multiple dependent Y factors, (that is, two or more Ys and one or more Xs), the multiple analysis of variance is used. MANOVA tests whether mean differences among groups of a combination of Ys are significant or not. The concept of various treatment levels and associated factors are still valid. The data should be normality distributed, have homogeneity of the covariance matrices, and have the independence of observations. ln ANOVA, a sum of squares is used for the treatments and for the error term. In MANOVA the terms become matrices of the “sum of squares and cross-products”(SSPCP). ANOVAs used multiple times across the dependent variables could result in inflated alpha errors. The MANOVA method is used to reduce the alpha risk by having only one test.
MANOVA Example
In an engineered plastics company, a multivariate experiment test was conducted having two independent variables (time and pressure of the extrusion process) at two levels, and three dependent responses (tensile strength, coefficient of friction, and bubble breaks). A MANOVA was conducted to test for relationships. The levels for the independent variables:
Time: high (+) equals 30 seconds, low (-) equals 10 seconds
Pressure: high (+) equals 80 psi, low (-) equals 20 psi
The shortened Minitab output for the MANOVA is presented in the table below. It only has three statistics tables for the responses and interactions. Minitab automatically inserts the four statistical tests (Wilks’, Lawley-Hotelling, Roy’s, and Pillai-Bartlett) and makes the analysis. The results indicate that both the factors, time and pressure are signiﬁcant with p values much below 5%. The interaction of time x pressure is not significant. For simplicity, the extensive SSCP tables were not displayed. For the individual familiar with linear algebra and matrices, the manual y calculations can also be made.
If you need assistance or have any doubt and need to ask any questions contact us at preteshbiswas@gmail.com. You can also contribute to this discussion and we shall be happy to publish them. Your comment and suggestion are also welcome.