SimpleRegressionandcorrelation.docx

Simple Regression

In the regression equation, we are trying to ascertain the influence of the independent variable(s) on the dependent variable. In simple regression, we have a single dependent variable and a single independent variable while in multiple regression, we have one dependent variable and multiple independent variables. Our goal is to understand the influence of the independent variable(s) on the dependent variable.

Let’s think about rain (in inches) and corn production (in tons). In this case of simple regression, the rain is the independent variable (denoted by x) and the corn production is the dependent variable (denoted by y). Our objective is to determine the influence of rain on the corn production.

Let's imagine after computation, the equation of the regression line is:

 

y = 2 + 0.5 x

 

This means that if there is no rain, the corn production will be 2 tons. For every one inch of rain, the corn production increases by 0.5 tons.

In real life we may have variables such as rain, machinery, soil condition and quality of fertilizer as the independent variables. So, we are trying to ascertain the influence of these independent variables on the corn production.

Correlation

Correlation determines the degree of association between two variables. In statistical sense, the correlation analysis measures the degree of linear relationship between two variables. We compute the correlation coefficient (r) to measure the degree of linear relationship between two variables. Numerically, the correlation coefficient ranges from 1 to 1.

Correlation is not causation. High correlation between two variables does not imply that one of these variables “causes” the behavior of the other variable. Correlation is a cause and effect relationship.

Simple Regression vs. Correlation
Correlation analysis attempts to determine whether there is a linear relationship between two variables and how strong is that relationship while simple regression determines the influence of independent variable over the dependent variable.
Coefficient of determination R2
The Coefficient of determination is the proportion of the variation in the dependent variable explained by the regression model, and is a measure of the goodness of fit of the model. It can range from 0 to 1. This signifies how much known deviation has been removed from the regression line. We need to remove a high degree of variations from the equation of the regression line. Higher the value of R2, -- better is the reliability of the regression line (for prediction and business decision making purposes).

Describing data with a simple regression equation

Graphically, we can draw a straight line on the graph so it passes through the cluster of points, as in Figure 1. Simple regression is a way of choosing the best straight line for this job.

srfig2 Figure 1

This raises two problems: what is the best straight line, and how can we describe it when we have found it?

Let's deal first with describing a straight line. Any straight line can be described by an equation relating the y values to the x values. In general, we usually write,

y = mx + c

Here m and c are constants whose values tell us which of the infinite number of possible straight lines we are looking at. m (from French monter) tells us about the slope or gradient of the line. Positive m means the line slopes upwards to the right; negative m that it slopes downwards. High m values mean a steep slope, low values a shallow one. The value of c (from French couper) tells us about the intercept, i.e. where the line cuts the y axis: positive c means that when x is zero, y has a positive value, negative c means that when x is zero, y has a negative value. But for regression purposes, it's more convenient to use different symbols. We usually write:

y = a + bx

This is just the same equation with different names for the constants: a is the intercept, b is the gradient.

The problem of choosing the best straight line then comes down to finding the best values of a and b. We define "best" in the same way as we did when we explained why the mean is the best summary of a set of data: we choose the a and b values that give us the line such that the sum of squared deviations from the line, instead of from the average, is minimized. This is illustrated in Figure 2. The best line is called the regression line, and the equation describing it is called the regression equation.

srfig3 Figure 2

Goodness of fit in regression

Having found the best straight line, the next question is how well it describes the data.

This is called the variance accounted for, symbolized by VAC or R2. Its square root is the Pearson product-moment correlation coefficient. R2 can vary from 0 (the points are completely random) to 1 (all the points lie exactly on the regression line); quite often it is reported as a percentage (e.g. 73% instead of 0.73). Two sets of data can have identical a and b values and very different R2 values, or vice versa.

Note carefully that a, b and R2 are all descriptive statistics. We have not said anything yet about significance tests. Given a set of paired x and y values, we can use Minitab to find the corresponding values of a, b and R2. It will also do some significance tests for us.

Problem: Suppose a vitamin and supplement supplier would like to investigate the relationship between the size of the order and the age of the customer who ordered it. The information will be used to target promotions to specific age groups. The following table shows the ages of seven random customers along with their more recent order sizes in dollars.

Age (in years)

Order Size ($)

41

54

26

30

34

22

54

63

29

15

49

25

38

85

Based on this information, (i) Find the equation of the regression line using the least square technique and interpret the result; (ii) Determine whether the slope of the regression line is significant at a 95% confidence; (iii) Compute the correlation coefficient and the coefficient of Determination.

You could try to solve this problem and if you have any questions, please send me an e-mail.

image1.jpeg

image2.jpeg