Regression analysis can be defined as the process of developing a mathematical model that can be used to predict one variable by using another variable or variables. This section first covers the key concepts of two common approaches to data analysis: graphical data analysis and correlation analysis and then introduces the two main types of regression: linear regression and non-linear regression. The section also introduces a number of data transformations and explains how these can be used in regression analysis.
When you have worked through this section, you should be able to:
• Distinguish between a dependent variable and an independent variable and analyse data using graphical means.
• Examine possible relationships between two variables using graphical analysis and correlation analysis.
• Develop simple linear regression models and use them as a forecasting tool.
• Understand polynomial functions and use non-linear regression as a forecasting tool.
• Appreciate the importance of data transformations in regression modelling.
Advertising cost example
It is well known that some form of advertising for a particular product will be associated with and have an effect on its sales. Numerical data has been collected from ten companies on their monthly volume of sales of a particular product as well as their cost of advertising for that product. This data is shown in table 2.1. We want to develop an appropriate regression model that will be based on this data and could be used to predict the volume of sales for a particular company, given that company’s advertising cost.
Table 2.1 Advertising cost data
Company Sales Advertising cost (£)
A 25000 800
B 35000 1200
C 29000 1100
D 24000 500
E 38000 1400
F 12000 300
G 18000 600
H 27000 800
I 17000 400
J 30000 900
In this example we have two variables, sales and advertising cost, and numerical data has been collected from a number of companies. The first thing to note is the distinction between this data set and the entire population of companies. What we have here is just a small sample taken from the entire population of companies selling the particular product.
The idea is to use the data from the given sample in order to develop a regression model that could then be used to predict the volume of sales for a particular company based on that company’s expenditure on advertising.
The regression model to be developed will relate the volume of sales to advertising cost. As we expect the volume of sales to depend on the cost of advertising, we take sales to be the dependent variable and advertising cost to be the independent variable.
Before we start developing the regression model we should first make sure that a relationship exists between advertising cost and the volume of sales. If such a relationship does not exist between the two variables, then there is no point in developing a regression model. Although a regression model could still be developed easily, that model wouldn’t be able to produce accurate forecasts and therefore make any significant contribution to decision making.
The relationship between two variables can be tested graphically using a scatter diagram or statistically using correlation analysis. The results from the analysis of data will tell us whether to use regression analysis as the forecasting tool and what type of regression model to develop.
The relationship between two variables can be examined graphically using a scatter diagram. A scatter diagram is a simple two-dimensional graph of the values of the dependent variable and the independent variable.
The important thing to remember when drawing a scatter diagram is that the dependent variable is always drawn on the vertical axis of the diagram and that the independent variable is always drawn on the horizontal axis of the diagram. The dependent variable is usually represented by Y and the independent variable is usually represented by X. This is the notation to be used throughout the course.
The following scatter diagram shows the volume of sales against advertising cost for the advertising cost data shown in table 2.1. The last part of this section explains how this diagram was produced on Excel.
Graph 2.1 Sales vs Advertising cost
Looking at the above diagram we can see that high volumes of sales are associated with high advertising costs and that low volumes of sales are associated with low advertising costs. In other words, a relationship exists between the two variables, with the volume of sales increasing as the advertising cost goes up. As this increase is linear (i.e. the value of Y increases with the value of X in a linear way), the relationship between the two variables is a linear relationship.
Now consider the scatter diagrams shown in graphs 2.2 and 2.3.
Graph 2.2 Perfect positive linear relationship
Graph 2.2 indicates a perfect linear relationship between the two variables, as all the data points on the graph have fallen exactly on a straight line. The relationship is also positive as the value of Y increases as the value of X goes up.
Graph 2.3 Perfect negative linear relationship
Graph 2.3 indicates another perfect linear relationship between the two variables, as all the data points on the graph have again fallen exactly on a straight line. This time, however, the relationship is negative as the value of Y decreases as the value of X goes up.
Finally, graph 2.4 shows a case of no relationship between the two variables. In such a case regression analysis would fail to produce accurate forecasts and therefore to make a contribution to decision making.
Graph 2.4 No relationship
Correlation analysis measures the degree of linear association between two variables.
The method is based on the following relation:
n?XY – ?X?Y
r = ————————————— (2.1)
There are a number of slightly different versions of the above formula but they all give the same result. Remember that X is the independent variable and Y is the dependent variable. ? is the summasion symbol. In other words, ?X is the sum of all the values of the independent variable and ?X2 is the sum of all the squared values of the independent variable. n is the number of observations (the number of data points in the sample).
Application of the above formula will produce the value of r. r is known as the correlation coefficient and its value determines the strength and direction of linear association between the two variables under examination. In other words, the value of r will tell us whether there is a relationship between the two variables and how strong that relationship is. If there is a relationship, then the value of r will also indicate whether the value of the dependent variable increases or decreases as the value of the independent variable goes up.
The value of r lies between –1 and 1, with –1 indicating a perfect negative linear relationship and 1 indicating a perfect positive linear relationship. A value around zero indicates no linear relationship between the two variables. In other words, values of the correlation coefficient near –1 or 1 indicate a strong correlation between the two variables, whereas values of the correlation coefficient near zero indicate no correlation between the two variables.
Now let’s use correlation analysis to examine the strength and direction of the relationship between sales and advertising cost in the advertising cost example used earlier on in this section. Table 2.1 is re-printed below. As the figures involved are very large, we have divided all sales figures by 1000 and all advertising cost figures by 100. This is a useful way to avoid dealing with calculations involving very large numbers. Although, Excel can handle data of any size, the idea here is to keep the complexity of the calculations low so that it is clear how the method works.
Table 2.2 Advertising cost data
Company Sales (000s) Advertising cost (00s)
A 25 8
B 35 12
C 29 11
D 24 5
E 38 14
F 12 3
G 18 6
H 27 8
I 17 4
J 30 9
The following table shows how the calculations have been carried out. Pay particular attention to the difference between ?X2 and (?X)2 (which also applies to the dependent variable).
Table 2.3 Sales vs Advertising cost correlation calculations
Y X Y2 X2 XY
25 8 625 64 200
35 12 1225 144 420
29 11 841 121 319
24 5 576 25 120
38 14 1444 196 532
12 3 144 9 36
18 6 324 36 108
27 8 729 64 216
17 4 289 16 68
30 9 900 81 270
255 80 7097 756 2289
n = 10 ?X2 = 756
?X = 80 ?Y2 = 7097
?Y = 255 (?X)2 = (80)2 = 6400
?XY = 2289 (?Y)2 = (255)2 = 65025
Substituting the above results into relation 2.1 we will get:
10 x 2289 – (80 x 255)
r = ——————————————————
(10 x 756 – (80)2) x (10 x 7097 – (255)2
22890 – 20400
(7560 – 6400) x (70970 – 65025)
= ———————- = ———–
1160 x 5945 2626.06
Note that Excel has a very useful function which can automatically calculate the value of r. This together with a number of other relevant functions are listed in the last part of this section.
The value of the correlation coefficient has been found to be 0.95 (rounded up to two decimal places) indicating that there is a strong positive correlation between advertising cost and the volume of sales. This confirms the findings from the scatter diagram (graph 2.1), which indicated that the volume of sales increases linearly with advertising cost.
Always keep in mind that a low correlation coefficient value does not necessarily mean that there is no relationship between the two variables. All it says is that there is no linear relationship between the variables – there may be a strong relationship but of a non-linear kind (this will be discussed further later in this section).
Developing a linear regression model
If we look at the advertising cost example we can see that both the scatter diagram (graph 2.1) and the value of the correlation coefficient (0.95) indicate that a strong linear relationship exists between advertising cost and sales. We can then use linear regression to describe that linear relationship.
The following graph shows the volume of sales against advertising cost with a straight line fitted on it. This line is called regression line and it is the result of using regression analysis. We shall now describe the process that produced this line.
Graph 2.5 Regression line for advertising cost example
As you can see from the above diagram, regression has fitted a straight line on the data. In fact that regression line has been fitted in such a way so that the sum of the distances between the data points and the line (i.e. the gaps between the data points and the line) is minimised. Because of this, the regression line is also known as the line of best fit.
Regression therefore aims to fit a line through the data in order to describe the relationship between two variables. If the relationship between the two variables is linear (like the one in this example), then a straight line is fitted through the data and the data points will lie very close to that line.
Obviously, we could visually draw a straight line through the data points of the scatter diagram in an attempt to fit the line to the points as closely as possible. The problem with this approach however is that, no matter how good our fit is, one could come up with a better fit. What we should do instead is to fit the regression line using a more statistical approach, which is known as the least squares regression method.
According to the least squares regression method, a regression line is fitted through the data in such a way so that the sum of the squares of the distances between the data points and the line is minimised. The resulting regression line could be straight or curved depending on the type of the relationship between the two variables.
A linear regression model is based on the linear function shown in relation 2.2.
Predicted Y = b0 + b1X (2.2)
The parameter b0 is called the intercept and the parameter b1 is called the slope of the regression line. The value of the intercept determines the point where the regression line meets the Y axis of the graph. The value of the slope represents the amount of change in Y when X increases by one unit.
Another name frequently used in regression analysis to refer to the independent variable is predictor. Note that the above regression model uses only one predictor and is therefore called a simple regression model. A model which uses more than one predictor is called a multiple regression model. Relation 2.3 shows the general form of a multiple regression model with k predictors.
Predicted Y = b0 + b1X1 + b2X2 + … + bkXk (2.3)
The regression line which appears on graph 2.5 has been produced on Excel (the way that this is done on Excel is explained in the last part of this section). Note here that Excel displays the regression equation in the form Y = b1X + b0. In other words, the order in which the values of the intercept and the slope appear is different to the one shown in relation 2.2, and Y actually refers to the predicted value of the response variable. To avoid any confusion, in this course we will use the regression equation exactly as it is shown in relation 2.2 but we will leave the equations on the graphs exactly as Excel displays them.
In order to develop a linear regression model of the form Predicted Y = b0 + b1X we need to calculate the values of b0 and b1. These values are given by the following relations:
n?XY – ?X?Y
b1 = ——————— (2.4)
b0 = —- – b1 —– (2.5)
Note that some textbooks use slightly different formulae to calculate the values of b0 and b1. This often happens with different textbooks but all the formulae used are mathematically equivalent to those shown above and they will give exactly the same results.
Application of the above formulae to the advertising cost data will produce the following results:
10 x 2289 – (80 x 255)
b1 = —————————–
(10 x 756-(80)2)
22890 – 20400
(7560 – 6400)
b0 = —— – 2.1466 ——
= 25.5 – 2.1466 x 8
Substituting the above values in relation 2.2 will give us the following regression equation:
Predicted Y = 8.3272 + 2.1466X
The value of b0 is 8.3272, which means that the regression line cuts the vertical axis of the graph at that point. Similarly, the value of b1 is 2.1466 indicating that the value of Y will increase by 2.1466 every time that the value of X increases by 1 (obviously, when X=0, Y=8.3272).
Excel can again calculate the values of b0 and b1 very easily and the steps are shown in the last part of this section.
Using the regression model to make predictions
Once a regression model has been developed it can then be used to predict the volume of sales for a company based on its advertising cost.
Suppose that we want to predict the volume of sales for a company which has spent £1000 on advertising. All we need to do is take this to be the value of X in the regression model and then calculate the corresponding value of Y. Note however that, as we have divided all advertising cost figures by 100 in order to make the figures more manageable, we also need to do the same with the new figure. Therefore, the value to be substituted in the regression model should be 10 (rather than 1000). The predicted volume of sales can therefore be calculated as follows:
Predicted Y = 8.3272 + 2.1466 (10) = 29.7932
The above result is the predicted value of sales for a company which has spent £1000 on advertising. Note that this figure should now be multiplied by 1000 in order to be converted back to the same units as the original data (this is again because we have divided all sales figures by 1000 in order to make the figures more manageable). Thus, a company which spends £1000 on advertising for a particular product is expected to sell 29,793 units of that product.
Note that to make this prediction we used an X value (1000) from the existing range of values of the X variable (300-1200). In general, it is too risky to attempt to predict a value of Y using an X value which is outside the range of X values of the data collected. That is because the linear relationship that exists between the two variables only covers the existing data and this could change if another range of values was considered.
Also note that the above prediction is based on the regression model, which is itself based on the data obtained from the ten companies. In other words, the regression model and any forecasts produced by that model are all based on sample data. Had a different sample been used, the regression model produced would have been different. This will be discussed further in the next section.
So far we have looked at how regression analysis could be used in situations where a linear relationship exists between two variables. However, there are situations where the two variables might be related in a non-linear way. In other words, although the results from correlation analysis have shown that a relationship does not exist between the two variables, these variables might still be closely related (don’t forget that correlation analysis measures the strength of linear association between the two variables). The forecaster should therefore make sure that the data is always graphed during the data analysis stage. The resulting graph will help the forecaster identify any non-linear patterns that correlation analysis has failed to spot.
If the results from data analysis show that there is a non-linear (also known as curvilinear) association between the two variables, then there is no point in developing a linear regression model. Although a linear regression model could be developed very easily, such a model would fail to produce a good fit and therefore generate accurate forecasts.
We can handle curved data in two ways: by using a polynomial rather than linear regression model, or by transforming the data and then using a linear regression model. The two methods are covered in the rest of this section.