y = f (x) Formula
To determine what factors in your process (as indicated by a measure) you can change to improve the CTQs(Critical to Qualities ) and, ultimately, the key business measures. It helps us to illustrates the causal relationship among the key business measures (designated as Y), the process outputs directly affecting the Y’s (designated as CTQ or y), and the factors directly affecting the process outputs (designated as x). It enables members of your improvement team to communicate the team’s findings to others in a simple format and also highlights the factors the team wants to change and what impact the change will have. It provides a matrix that can be used in the Control step of the DMAIC method for ongoing monitoring of the process after the team’s improvement work is complete Many people understand the concept of y = f (x) from mathematical education. The x, y, Y matrix is based on this concept. If it confuses team members to use these letters, simply use the terms key business measures, CTQ or process outputs, and causal factors instead; they represent the same concepts.
Gather the key business measures for your project either from the team charter or check with your sponsor). Gather the CTQs that the improvement team selects as the most important for your project. List the key business measure and the CTQ operational definition in a matrix. As your team progresses through the Measure and Analyze steps of the DMAIC method, add the causal-factor definitions (x’s) you discover
Guidelines for Filling Out an x, y, Y Matrix
A sample x,y, Y Matrix
Correlation is used to determine the strength of linear relationships between two process variables. It allows the comparison of an input to an output, two inputs against each other, or two outputs against each other .Correlation measures the degree of association between two independent continuous variables. However, even if there is a high degree of correlation, this tool does not establish causation. For example, the number of skiing accidents in Colorado is highly correlated with sales of warm clothing, but buying warm clothes did not cause the accidents. Correlation can be analyzed by calculating the Pearson product moment correlation coefficient (r). This coefficient is calculated as follows:
Where Sx and Sy are the sample standard deviations. The resulting value will be a number between -1 and +1. The higher the absolute value of r, the stronger the correlation. A value of zero means there is no correlation. A strong correlation is characterized by a tight distribution of plotted pairs about a best-fit line. It should be noted that correlation does not measure the slope of the best-fit line; it measures how close the data are to the best-fit line. A negative r implies that as one variable (x2) increases, the other variable (x1) decreases.
A positive r implies that as one variable (x3) increases, the other variable (x1) also increases.
A strong relationship other than linear can exist, yet r can be close to zero.
Regression measures the strength of association between independent factor(s) (also called predictor variable(s) or regressors) and a dependent variable (also called a response variable). For simple or multiple linear regression, the dependent variable must be a continuous variable. Predictor variables can be continuous or discrete, but must be independent of one another. Discrete variables may be coded, discrete levels (dummy variables (0, 1) or effects coding (-1, +1)).Regression is used to investigate suspected correlations by generating an equation that quantifies the relationship. It explains the relationship through an equation for a line, curve, or surface. It explains the variation in y values and helps to predicts the impact of controlling a process variable (x). It helps to predict future process performance for certain values of x. It also help to identify the vital few x’s that drive y and also helps you to manipulate process conditions to generate desirable results (if x is controllable) and/or avoid undesirable results.
For linear regressions (i.e., when the relationship is defined by a line), the regression equation is represented as y = ao + a1x, where ao = intercept (i.e., the point where the line crosses x = 0) and a1 = slope (i.e., rise over run, or change in y per unit increase in x).
- Simple linear regression relates a single x to a y. It has a single regressor (x) variable and its model is linear with respect to coefficients (a).
y = a0 + a1x + error
y = a0 + a1x + a2 x2 + a3 x3 + error.
“Linear” refers to the coefficients a0, a1, a2, etc. In the second example, the relationship between x and y is a cubic polynomial in nature, but the model is still linear with respect to the coefficients.
- Multiple linear regression relates multiple x’s to a y. It has multiple regressor (x) variables such as x1, x2, and x3. Its model is linear with respect to coefficients (b).
y = b0 + b1x1 + b2x2 + b3x3 + error
- Binary logistic regression relates x’s to a y that can only have a dichotomous value (one of two mutually exclusive outcomes such as pass/fail, on/off, etc.)
- Least squares method: Use the least squares method, where you determine the regression equation by using a procedure that minimizes the total squared distance from all points to the line. This method finds the line where the squared vertical distance from each data point to the line is as small as possible (or the “least”). This means that the method minimizes the “square” of all the residuals.
Steps in Regression Analysis
- Plot the data on a Scatter Diagram: Be sure to plot your data before doing regression. The charts below show four sets of data that have the same regression equation: y = 3 + 0.5x.
Obviously, there are four completely different relationships.
- Measure the vertical distance from the points to the line
- Square the figures
- Sum the total squared distance
- Find the line that minimizes the sum
Generally a computer program is used to generate the “best fit” line that represents the relationship between x and y. The following sets of terms are often used interchangeably:
- Regression equation and regression line.
- Prediction equation and prediction line.
- Fitted line, or fits, and model.
When two variables show a relationship on a scatter plot, they are said to be correlated, but this does not necessarily mean they have a cause/ effect relationship. Correlation means two things vary together. Causation means changes in one variable cause changes in the other.
The residual is the leftover variation in y after you use x to predict y. The residual represents common-cause (i.e., random and unexplained) variation. You determine a residual by subtracting the predicted y from the observed y
Residuals are assumed to have the following properties:
- Not related to the x’s.
- Stable, independent, and not changing over time.
- Constant and not increasing as the predicted y’s increase.
- Normal (i.e., bell-shaped) with a mean of zero.
check for each of these assumptions. If the assumptions do not hold, the regression equation might be incorrect or misleading.
Simple Linear Regression Model
Consider the problem of predicting the test results (y) for students based upon an input variable (x), the amount of preparation time in hours using the data presented in Table below.
Study times (hours)
Test Results (%)
Study Time Versus Test Results
An initial approach to the analysis of the data is to plot the points on a graph known as a scatter diagram. Observe that y appears to increase as x increases. One method of obtaining a prediction equation relating y to x is to place a ruler on the graph and move it about until it seems to pass through the majority of the points, thus providing what is regarded as the “best fit” line.
The mathematical equation of a straight line is:
Y = β0 + β1x
Where β0 is the y intercept when x = 0 and β1 is the slope of the line. Here the x axis does not go to zero so the y intercept appears too high. The equation for a straight line in this example is too simplistic. There will actually be a random error which is the difference between an observed value of y and the mean value of y for a given value of x. One assumes that for any given value of x, the observed value of y varies in a random manner and possesses a normal probability distribution.
The probabilistic model for any particular observed value of y is:
Mean value of y for
y = (mean value of y for a given value of x )+ (random error)
Y = β0 + β1x+ε
The Method of Least Squares
The statistical procedure of finding the “best-fitting” straight line is, in many respects, a formalization of the procedure used when one fits a line by eye. The objective is to minimize the deviations of the points from the prospective line. If one denotes the predicted value of y obtained from the fitted line as the prediction equation is:
Having decided to minimize the deviation of the points in choosing the best fitting line, one must now define what is meant by “best.”
The best fit criterion of goodness known as the principle of least squares is employed:
Choose, as the best fitting line, the line that minimizes the sum of squares of the deviations of the observed values of y from those predicted. Expressed mathematically, minimize the sum of squared errors given by:
The least square estimator of β0 and β1, are calculated as follows:
One may predict y for a given value of x by substitution into the prediction equation. For example, if 60 hours of study time is allocated, the predicted test score would be:
While doing Regression analysis be careful of rounding errors. Normally, the calculations should carry a minimum of six significant figures in computing sums of squares of deviations. Note that the prior example consisted of convenient whole numbers which does not occur often. Always plot the data points and graph the least squares line. If the line does not provide a reasonable fit to the data points, there may be a calculation error. Projecting a regression line outside of the test area can be risky. The above equation suggests, without study, a student would make 31% on the test. The odds favor 25% if answer a is selected for all questions. The equation also
suggests that with 100 hours of study the student should attain 100% on the examination – which is highly unlikely.
Calculating Sε2 , an Estimator of σε2
Recall, the model for y assumes that y is related to x by the equation:
Y = β0 + β1x+ε
If the least squares line is used:
A random error, 5, enters into the calculations of β0 and β1. The random errors affect the error of prediction. Consequently, the variability of the random errors (measured by σε2 plays an important role when predicting by the least squares line.
The first step toward acquiring a boundary on a prediction error requires that one estimates σε2. It is reasonable to use SSE (sum of squares for error) based on (n – 2) degrees of freedom, one for each variable (x and y).
An Estimator for σε2
SSE = Sum of squared errors
SSE may also be written:
Example : Calculate an estimated σε2 for the data in Table given above.The existence of a significant relationship between y and x can be tested by whether β1 is equal to 0. If β1≠ 0 there is a linear relationship. The null hypothesis and alternative hypothesis are:The test statistic is a t distribution with n – 2 degrees of freedom:
Example: From the data in Table above, determine if the slope results are significant at a 95% confidence level.
For a 95% confidence level, determine the critical values of t with α= 0.025 in each tail, using n – 2 = 8 degrees of freedom: t0.025,8 = -2.306 and t0.025,8 = 2.306. Reject the null hypothesis if t > 2.306 or t < -2.306, depending on whether the slope is positive or negative. In this case, the null hypothesis is rejected and we conclude that β1 ≠ 0 and there is a linear relationship between y and x.
Confidence Interval Estimate for the Slope β1
The confidence interval estimate for the slope B, is given by:
For example by Substitute previous data into the above formula to obtain the confidence interval around the slope of the line.
Intervals constructed by this procedure will enclose the true value of β1 95% of the time. Hence, for every 10 hours of increased study, the expected increase in test scores would fall in the interval of 3.86 to 10.05 percentage points.
The population linear correlation coefficient, p, measures the strength of the linear relationship between the paired x and y values in a population. p is a population parameter. For the population, the Pearson product moment coefficient of correlation, pm is given by:Where cov means covariance. Note that -1 ≤ρ≤ +1
The sample linear correlation coefficient, r, measures the strength of the linear relationship between the paired x and y values in a sample. r is a sample statistic. For a sample, the Pearson product moment coefficient of correlation, rx,y is given by:
For Example , Using the study time and test score data reviewed earlier, determine the correlation coefficient. sxy = 772, sx = 1,110, sy = 696.9
The numerator used in calculating r is identical to the numerator of the formula for the slope β1. Thus, the coefficient of correlation r will assume exactly the same sign as β1 and will equal zero when β1 = 0.
- A positive value for r implies that the line slopes upward to the right.
- A negative value for r implies that the line slopes downward to the right.
- Note that r = 0 implies no linear correlation, not simply “no correlation.” A pronounced curvilinear pattern may exist.
When r = 1 or r = -1, all points fall on a straight line; when r = 0, they are scattered and give no evidence of a linear relationship. Any other value of r suggests the degree to which the points tend to be linearly related. If x is of any value in predicting y, then SSE can never be larger than:
Coefficient of Determination (R2)
The coefficient of determination is R2. The square of the linear correlation coefficient is r2. It can be shown that: R2 = r2 .
The coefficient of determination is the proportion of the explained variation divided by the total variation, when a linear regression is performed. r2 Iies in the interval of 0 ≤r2≤1. r2 will equal +1 only when all the points fall exactly on the fitted line. That is, when SSE equals zero.
For Example: Using the data from Example above , determine the coefficient of determination.
One can say that 77% of the variation in test scores can be explained by variation in study hours.
Where SST = total sum of squares (from the experimental average) and SSE = total sum of squared errors (from the best fit). Note that when SSE is zero, r2 equals one and when SSE equals SST, then r2 equals zero.
Correlation Versus Causation
In the above example, there is strong evidence of a correlation between car weight and gas milage. The student should be aware that a number of other factors (carburetor type, car design, air conditioning, passenger weights, speed, etc.) could also be important. The most important cause may be a different or a collinear variable. For example, car and passenger weight may be collinear. There can also be such a thing as a nonsensical correlation, i.e. it rains after my car is washed.
Simple Linear Regression In nutshell
- Determine which relationship will be studied.
- Collect data on the x and y variables.
- Set up a fitted line plot by charting the independent variable on the x axis and the dependent variable on the y axis.
- Create the fitted line. If creating the fitted line plot by hand, draw a straight line through the values that keep the least amount of total space between the line and the individual plotted points (a “best fit”).If using a computer program, compute and plot this line via the “least squares method.”
- Compute the correlation coefficient r.
- Determine the slope or y intercept of the line by using the equation y = mx + b. The y intercept (b) is the point on the y axis through which the “best fitted line” passes (at this point, x = 0). The slope of the line (m) is computed as the change in y divided by the change in x (m = Δy/ Δx). The slope, m, is also known as the coefficient of the predictor variable, x.
- Calculate the residuals. The difference between the predicted response variable for any given x and the experimental value or actual response (y) is called the residual. The residual is used to determine if the model is a good one to use. The estimated standard deviation of the residuals is a measure of the error term about the regression line.
- To determine significance, perform a t-test (with the help of a computer) and calculate a p-value for each factor. A p-value less than α (usually 0.05) will indicate a statistically significant relationship.
- Analyze the entire model for significance using ANOVA, which displays the results of an F-test with an associated p-value.
- Calculate R2 and R2 adj. R2, the coefficient of determination, is the square of the correlation coefficient and measures the proportion of variation that is explained by the model. Ideally, R2 should be equal to one, which would indicate zero error.
R2 = SSregression / SStotal
= (SStotal – SSerror ) / SStotal
= 1-[SSerror / SStotal ]
Where SS = the sum of the squares.
R2 adj is a modified measure of R2 that takes into account the number of terms in the model and the number of data points.
R2 adj = 1- [SSerror / (n-p)] / [SStotal / (n-1)]
Where n = number of data points and p = number of terms in the model. The number of terms in the model also includes the constant.
Note: Unlike R2, R2 adj can become smaller when added terms provide little new information and as the number of model terms gets closer to the total sample size. Ideally, R2 adj should be maximized and as close to R2 as possible. Conclusions should be validated, especially when historical data has been used.
Multiple Linear Regression
Multiple linear regression is an extension of the methodology for linear regression to more than one independent variable. By including more than one independent variable, a higher proportion of the variation in y may be explained.
First-Order Linear Model
Y = β0 + β1x1 +β2x2 +…… + βkxk + ε
A Second-Order Linear Model (Two Predictor Variables)
Y = β0 + β1x1 + β2x2+β3x1x2 + β4x12+β5x22 + ε
Just like r2 (the linear coefficient of determination) R2 (the multiple coefficient of determination) take values in the interval: 0≤R2≤1
Attributes Data Analysis
The analysis of attribute data is organized into dichotomous values, categories, or groups. Applications involve decisions such as yes/no, pass/fail, good/bad, poor/fair/good/super/ excellent, etc. Some of the techniques used in nonlinear regression models include: logistic regression analysis, logit regression analysis, and probit regression analysis. A description of the three models follows:
- Logistic regression relates categorical, independent variables to a single dependent variable. The three models described within Minitab are binary, ordinal, and nominal.
- Logit analysis is a subset of the log-linear model. It deals with only one dependent variable, using odds and odds ratio determinations.
- Probit analysis is similar to accelerated life testing. A unit has a stress imposed on it with the response being pass/fail, good/bad, etc. The response is binary (good/bad) versus an actual failure time.
Log-linear models are nonlinear regression models similar to linear regression equations. Since they are nonlinear, it is necessary to take the logs of both sides of the equation in order to produce a linear equation. This produces a log-linear model. Logit models are subsets of this model.
Logistic regression is used to establish a y = f (x) relationship when the dependent variable (y) is binomial or dichotomous. Similar to regression, it explores the relationships between one or more predictor variables and a binary response. Logistic Regression helps us to predict the probability of future events belonging to one group or another (i.e., pass/fail, profitable/nonprofitable, or purchase/not purchase). Logistic regression relates one or more independent variables to a single dependent variable. The independent variables are described as predictor variables and the response is a dependent variable. Logistic regression is similar to regular linear regression, since both have regression coefficients, predicted values, and residuals. Linear regression assumes that the response variable is continuous, but for logistic regression, the response variable is binary. The regression coefficients for linear regression are determined by the ordinary least squares approach, while logistic regression coefficients are based on a maximum likelihood estimation.
Logistic regression can provide analysis of the two values of interest: yes/no, pass/fail, good/bad, enlist/not enlist, vote/no vote, etc. A logistic regression can also be described as a binary regression model. It is nonlinear and has a S-shaped form. The values are never below 0 and never above 1. The general logistic regression equation can be shown as:
y=b0+b1x1+e where y=0,1
The probability of results being in a certain category is given by:
The predictor variables (x’s) can be either continuous or discrete, just as for any problem using regression. However, the response variable has only two possible values (e.g., pass/fail, etc.). Because regression analysis requires a continuous response variable that is not bounded, this must be corrected. This is accomplished by first converting the response from events (e.g., pass/fail) to the probability of one of the events, or p. Thus if p = Probability (pass), then p can take on any value from 0 to 1. This conversion results in a continuous response, but one that is still bounded. An additional transformation is required to make the response both continuous and unbounded. This is called the link function. The most common link function is the “logit,” which is explained below.
Y = β0 + β1x
We need a continuous, unbounded Y.
Logistic regression also known as Binary Logistic regression(BLR) fits sample data to an S-shaped logistic curve. The curve represents the probability of the event. At low levels of the independent variable (x), the probability approaches zero. As the predictor variable increases, the probability increases to a point where the slope decreases. At high levels of the independent variable, the probability approaches 1. The following two examples fit probability curves to actual data. The curve on the top represents the “best fit.” The curve through the data on the bottom contains a zone of uncertainty where events and non-events (1’s and 0’s) overlap.
If the probability of an event, p, is greater than 0.5, binary logistic regression would predict a “yes” for the event to occur. The probability of an event not occurring is described as (1-p). The odds, or p/(1-p),compares the probability of an event occurring to the probability of it not occurring. The logit, or “link” function, represents the relationship between x and y.
Step for Logistic regression
- Define the problem and the question(s) to be answered.
- Collect the appropriate data in the right quantity.
- Hypothesize a model.
- Analyze the data. Many statistical software packages are available to help analyze data.
- Check the model for goodness of fit.
- Check the residuals for violations of assumptions.
- Modify the model, if required, and repeat.
A example will be used to compare the number of hours studied for exam versus pass/fail responses. The data is provided for 50 students. The number of hours a student spends studying is recorded. In addition, the end result, the dependent (pass/fail) variable, is noted. In logistic regression, because of the use of attribute data, there should be 50 data points per variable. An analysis will be made for the regular linear regression model. This result will then be compared to the logistic regression model. Logistic regression can be used to predict the probability that an observation belongs to one of two groups.
For the logistic regression example, Minitab is used to determine the regression coefficients. Using Excel to calculate the probabilities, the logistic regression curve is displayed in Figure
An S-shaped curve can be used to smooth out the data points. The curve moves from the zero probability point up to the 1.0 line. The probabilities in the logistic curve were calculated from the equation:Using Minitab, the regression coefficients and the equation can be determined. After determining the regression coefficients, the probability of a student passing the exam, after studying 80 hours, can be calculated.
It appears that there is a 54.5% probability of passing after 80 hours of study. At 100 hours or more, the probability of passing increases to more than 90%.
Minitab 13 has three logistic regression procedures, as described
Grimm provides the following logistic regression assumptions:
- There are only two values (pass/fail) with only one outcome per event
- The outcomes are statistically independent
- All relevant predictors are in the model
- It is mutually exclusive and collectively exhaustive (one category at a time)
- Sample sizes are larger than for linear regression
The individual regression coefficients can be tested for significance through comparison of the coefficient to its standard error. The 2 value will be compared to the value obtained from the normal distribution.
The logistic regression model can be tested via several goodness-of-fit tests. Minitab will automatically test three different methods: Pearson, Deviance, and Homer-Lemeshow. The simple logistic regression model can be extended to include several other predictors (called multiple logistic regression). If the model contains only categorical variables, it can be classified as a log-linear model.
Logit Analysis .
Logit analysis uses odds to determine how much more likely an observation will be a member of one group versus another group (pass/fail, etc.). A probability of p = 0.80 of being in group A (passing) can be expressed in odds terms as 4:1. There are 4 chances to pass versus 1 chance to fail, or odds of 4:1. The ratio 4/1 or p/(1-p) is called the odds, and the log of the odds, L=ln(p/(1-p)) is called the Logit.
The logit ranges from 0 to 1: 0 < L < 1. The probability for a given L value is provided by the equation:
p =eL/(1 + eL)
Example: From the previous data, there were 50 students who took the exam, but only 27 passed. What are the odds of passing?
Odds = p/(1-P) = 0.54/0.46 = 1.17 or 1.17:1
Example From the previous data, a student studying 80 hours has a 54.5% chance of passing. What are the odds and the accompanying logit probability?
Odds =P/(1-P)=0.545/(1-0.545) =0.545/0.455 =1.198 or 1.198:1
Logit = In(p/(1 – p)) = ln(1.198) = 0.1809
To find the probability, use the logit equation:
p = eL/(1 + eL)= e0.1809/(1 + e0.1809)=1.198/2.198=0.545
If the student studies 80 hours, the probability of passing is 54.5%. This is the same result as before, but represents another way to calculate it.
The odds ratio is the change in the odds of moving up or down a level in a group for a one-unit increase or decrease in the predictor. The exponent, e, and slope of coefficient, b,, are used to determine the odds. lf b1 = 0.10821, then the odds of moving to another level is: eb1 = e0.10821 = 1.1
Positive effects are greater than 1, while negative effects are between 0 and 1.
Logit Regression Model
In cases where the values for each category are continually increasing or continually decreasing, a log transform should be performed to obtain a near straight line. Expanding the logit formula to obtain a straight linear model results in the following formula:
L = logit = ln(p/(1-p)) = ln(eb0+b1x1) = ln(eb0eb1x1) = b0 + b1x1
The equation expanded for multiple predictor variables, x1, x2, …, Xn is: L = b0 + b1x1 + b2x2 + + bnxn
Logit Regression Example
A medical researcher, with an interest in physical ﬁtness, conducted a long-term walking plan for weight loss. She was able to enroll 1,120 patients in a 2 year walking program. The results were positive and weight loss appeared to accelerate as the number of steps walked increased. The data is presented in Table below. A linear regression of the data produced a good R2 value of 93.8%. However, the graph indicated nonlinear results.
A logit transformation of the data values was performed. The logit value was obtained by dividing the “number lost > 30 lb” by “number lost < 30 lb”, and then taking the natural log. A regression was performed on the steps walked and logit to obtain a new equation.
The resulting R2 was 98.7%. The equation is: L = -5.307 + 0.00067053 x1
Probit analysis is similar to accelerated life testing and survivability analysis. An item has a stress imposed upon it to see if it fails or survives. The probit model has an expected variance of 1 and a mean of zero. The logit model has an expected variance of π2 / 3 = 3.29 and an expected mean of zero. This probit model is close to the logit model, since it requires extremely large sample sizes to realize a difference from the logit model.
The probit model is: Φ-1(p) = α + βx = bo + b1x
Where; b0=-µ/σ and b1=-1/σ or σ=1/Ib1I
In comparing the logit to the probit models, the b coefficients of the logit compared to the probit differ by 1.814. That is: bL = -1.814 bp
For Example: A circular plate is welded to a larger plate to form a supporting structure. There is a need to validate the structure’s capability to resist a torque force. This is a destructive test of the weldment, with a binary response consisting of success or failure. A torque wrench will be used to twist the structure. The levels of applied force are 50, 100, 150, and 200 Ibf-in. A total of 100 samples will be tested at each level of force.
The probit analysis, using Minitab, indicates the normal model was nonlinear and the logistic model would be a better choice. The coefficients are:
b0= 4.0058, b1 = -0.031368The probit model has 7 life distributions to choose from: normal, lognormal (base e), lognormal (base 10), logistic, loglogistic, Weibull, and extreme value. Using Minitab, the best fit for the above data was the logistic distribution. The percentiles, survival probabilities, and plots can also be obtained using Minitab.
For the weldment example, a table of percentiles (Minitab) provides the percentage of surviving parts at various levels of torque. The plot in Figure shows that the 50% level to be about 130 lbf-in. The 5% survival level would be about 220 lbf-in.
Refer to Table below for a listing of low survival percentages.
The 5% success level (or 95% failure level) indicates the torque force to be 221.5752 lbf-in. The 95% conﬁdence interval is also included.
If you need assistance or have any doubt and need to ask any question contact us at: firstname.lastname@example.org. You can also contribute to this discussion and we shall be happy to publish them. Your comment and suggestion is also welcome.