QUANT homework

SimpleRegressioninExcel.xlsx

Home >Information Systems homework help >QUANT homework

Introduction

Ideally the goal is to select explanatory variables to build the simplest regression equation possible that will produce a good estimate of the response variable (parsimony)
		this to simplify interpretation of the regression coefficients
		selecting good variables for use in a regression analysis usually involves initially engaging in some research
To use linear simple linear regression analysis: There needs to be a linear relation between:
	1) the response variable and the explanatory variable
Investigate the linear relations by examining:
	1) Correlation: the numerical measure of strength in the relation between the explanatory and response variables
	2) scatter plot of the response variable against the independent variable
		the plot can show linear dot pattern , indicating likely need for the explanatory variable to be included in the regression equation
		the plot can show non-linear dot pattern suggesting a transformation of an explanatory variable may be needed
		the plot can show a horizontal dot pattern suggesting the variable should not be used in the regression equation
	3) residual plots
		looking for a scattered, cloudy dot pattern with consistent variance at each level of the explanatory variable to help confirm linear relation likely exists
			if a pattern is observed, this suggests transformations many need to be applied

correlation

	Response variables (variable of interest) : typically placed on y-axis
	explanatory variables: typically placed on x-axis
	Choosing the role of the variables depends on how you think the variables are related
	Correlation coefficient (r): permits you to quantify the strength of the linear relation between the 2 variables
		In Excel use =CORREL feature
properties of correlation coefficient
	range from -1 to 1
	sign gives the direction of the association
	unitless measure
	not affected by changes to the center or scale of the variables
	depends only on the z-scores
	sensative to outliers
	does not imply causation (could be lurking variables standing behind the association)
	check the scatter plot for straightness even if the value of r is high

scatter

Scatterplots

displays relation (association) between 2 quantitative variables

investigor can see patterns, trends, and outliers

1. Direction of the association

negative

positive

2. Form of the association

nonlinear vs linear

3. Strength of association

How discernable is the data pattern or trend?

weak, moderate, strong

residual

Question: What line can be used to model the linear data pattern displayed in the scatterplot?

the equation of the best fit line is the regression line (model)

the estimates for the response variable made through the regression line are called the predicted values

residuals

To find residuals, subtract the predicted value (calculated with the regression equation) from the observed value

negative residual: the regression equation is overestimating

postive residual: the regression equation is underestimating

Best fitting line means the sum of its least squared residuals is smallest

Example1A

Question: Is there a linear relation between the response variable (salary) and housing price?
		1) correlation (CORELL feature in Excel)
		2) scatter plot
		3) residual plot
			1) Correlation		0.9086846363	3) Residual Plot (Run the regression in Excel, using the Data Analysis Toolbar. Select Residual plot
				A correlation of approximately .91 implies that salary and housing price have
				a strong, positive, linear relation where salary increase ad housing price increases in a systematically predictable way
	Salary($10,000)	Housing Price ($10,000)
	1	45	2) Scatter Plot (select variables and use the Insert .. Charts menu)
	2	47
	3	57
	4.5	59
	5.2	60
	8.4	65
	11	66
	12.4	67
	12.5	69
	13.1	70
	15	70
	17	72
	22.2	73
	22.6	74				There is a pattern (dot pattern is not scattered) evident in the residual plot. This suggests that some other relation, beyond a linear relation
	25.8	76				would likely better represent the relation between salary and housing price. Hence a transformation should be
						performed
			From the scatter plot is can be seen that the relation appears to be positive and strong. However there also
			appears to be a slight bend in the dot pattern. Further investigation is needed in order to determine if the
			trend is linear

Housing Price ($10,000) 1 2 3 4.5 5.2 8.4 11 12.4 12.5 13.1 15 17 22.2 22.6 25.8 45 47 57 59 60 65 66 67 69 70 70 72 73 74 76

Salary

Housing Price ($10,000)

Transform1

Goal: Make the dot pattern of a scatterplot more nearly linear
	calculate transform for the explanatory and the response variables
	look at scatterplots of combinations of transformed and/or non-transformed values
	Can use Adjusted R-Squared to help chose between candidate linear equation models that result
		Non-linear dot patterns in scatterplots and possible transformations to straighten
year	salary (in $10,000)	LN(salary)
1980	1
1982	2
1990	3
1990	4.5
1991	5.2
1996	8.4
1997	11
1998	12.4
1999	12.5
1999	13.1
1999	15
2001	17
2001	22.2
2004	22.6
2005	25.8

(LN(x), Y) (x, y2) (LN(y), LN(x))

(LN(x), Y) (x, 1/y)

(x, SQRT(y)) (x, LN(y)) (LN(y), LN(x))

Significance

What else to check in the Regression Output
The overall significance of the regression equation (in the ANOVA table)
The regression coefficent is the average change in y expected per unit change in the explanatory variable when all other explanatory variables are held constant
How significant are the regression equation and/or the coefficients?
	less than .05 for the significance and/or p-values , 1 or more of the explanatory variables are good predictors of the response variable
	also, check R-squared (use adjusted R-squared)
Standard Approach we will use (1 of the following will be the investigation conclusion)
If regression equation is not significant, the model is not a good fit to the data
If the regression model is significant but the coefficient is not, the model provides improved fit over using the expected value of the response variable as the estimated prediction
If the regression model is significant and the coefficient is significant, the model is a good fit for the data and the explanatory variable is contributing significantly towards the quality of prediction
	Coefficient significance	Model significance

Example2

Is the regression model with the Square Root Transformed response variable salary the better model? Yes according to the analysis (scatter plots, residual plots, adjusted r-squared, and standard error comparisons)
year (actual)	year (Explanatory)	salary (in $10,000) (Response)	year (Explanatory)	SQRT(Salary)			Compare Scatter Plots : Which looks visually straighter ?
1980	80	1	80	1
1982	82	2	82	1.4142135624
1990	90	3	90	1.7320508076
1990	90	4.5	90	2.1213203436
1991	91	5.2	91	2.2803508502
1996	96	8.4	96	2.8982753492
1997	97	11	97	3.3166247904
1998	98	12.4	98	3.5213633723
1999	99	12.5	99	3.5355339059
1999	99	13.1	99	3.6193922142
1999	99	15	99	3.8729833462
2001	101	17	101	4.1231056256
2001	101	22.2	101	4.7116875958
2004	104	22.6	104	4.7539457296
2005	105	25.8	105	5.0793700397			Compare Residual Plots : Which has a cloudy, scattered dot pattern ?
								Response: Salary							Response : Square Root of Salary
					Compare Adjusted R-squared values : Which is largest ? Compare Standard Error: Which is smallest?
							Both models are significant
							The regression model with the transformed response variable has the largest Adjusted R-squared value and smallest standard error
					Response Variable : Salary									Response Variable : Square Root Of Salary
					Regression Statistics											Regression Statistics
					Multiple R	0.9150264296										Multiple R	0.9668776273
					R Square	0.8372733668										R Square	0.9348523462
					Adjusted R Square	0.8247559335										Adjusted R Square	0.9298409882
					Standard Error	3.28038926										Standard Error	0.3337444496
					Observations	15										Observations	15
					ANOVA											ANOVA
						df	SS	MS	F	Significance F							df	SS	MS	F	Significance F
					Regression	1	719.7849352684	719.7849352684	66.8885821383	0.0000017547						Regression	1	20.7785720521	20.7785720521	186.5467103404	0.0000000044
					Residual	13	139.892398065	10.7609536973								Residual	13	1.4480096496	0.1113853577
					Total	14	859.6773333333									Total	14	22.2265817017
						Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%	Lower 95.0%	Upper 95.0%				Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%	Lower 95.0%	Upper 95.0%
					Intercept	-80.4842432619	11.3048781078	-7.1194260119	0.0000078229	-104.906947591	-56.0615389328	-104.906947591	-56.0615389328			Intercept	-12.4661593835	1.1501501874	-10.8387230813	0.0000000701	-14.9509077987	-9.9814109684	-14.9509077987	-9.9814109684
					year (Explanatory)	0.9657567381	0.1180841892	8.1785440109	0.0000017547	0.710651357	1.2208621192	0.710651357	1.2208621192			year (Explanatory)	0.164087017	0.0120138007	13.6582103637	0.0000000044	0.1381327785	0.1900412554	0.1381327785	0.1900412554

salary (in $10,000) (Response) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 2 3 4.5 5.2 8.4 11 12.4 12.5 13.1 15 17 22.2 22.6 25.8

SQRT(Salary) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 1.4142135623730951 1.7320508075688772 2.1213203435596424 2.2803508501982761 2.8982753492378879 3.3166247903553998 3.5213633723318019 3.5355339059327378 3.6193922141707713 3.872983346207417 4.1231056256176606 4.7116875957558984 4.7539457296018854 5.0793700396801178

regression equation after trans

To undo the LN transformation to determine a regression equation for the original salary units, take the exponential (e) of both sides of the equation

To Undo the LN transformations (x, LN(Y)) LN(Y) = ax + b y = e(ax + b) y = eax (eb) (LN(x), y) y = aLN(x) + b (LN(x), LN(y)) LN(y) = aLN(x) + b LN(y) = aLN(x) + b LN(y) = LN(xa) + b y = xa + eb

numerical summaries

Numerical Summaries: Excel Regression Output Table (color coded)
Regression Coefficients			Excel Regression Output Table
	y-intercept: average value of the response variable when all explanatory variables are 0		Regression Statistics
		impact of explanatory variables are not considered	Multiple R	0.9150264296
			R Square	0.8372733668
regression coefficients: the average predicted change in the response variable per unit change in the explanatory variable			Adjusted R Square	0.8247559335
			Standard Error	3.28038926
Correlation Coefficient: Strength of Linear Relation			Observations	15
Coefficient of Determination (r^2) - Use Adjusted R^2			ANOVA
	fraction of the variation in the data accounted for by the regression equation			df	SS	MS	F	Significance F
	sometimes reported as a percentage		Regression	1	719.7849352684	719.7849352684	66.8885821383	0.0000017547
	between 0 and 1		Residual	13	139.892398065	10.7609536973
			Total	14	859.6773333333
SEE (Standard Error of the Estimate)
	standard deviation of the residuals			Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%	Lower 95.0%	Upper 95.0%
	how spread out the observations are from the regression line		Intercept	-80.4842432619	11.3048781078	-7.1194260119	0.0000078229	-104.906947591	-56.0615389328	-104.906947591	-56.0615389328
			year (Explanatory)	0.9657567381	0.1180841892	8.1785440109	0.0000017547	0.710651357	1.2208621192	0.710651357	1.2208621192
Confidence Intervals : Range, withing specified level of confidence, that the population parameter resides

Influential Points

Does the regression equation change significantly when the isolated points is removed?

Would a different result from the analysis be concluded?

If yes, then the point is influential

In the regression output look at:

significance of model or coefficients

numerical summaries

Example

with influencial point

without influential point

recode

Recode multiple category variables (3 or more categories) to 2 category variable
Gender: male (1) female (0)
Political Party (0) republican, (1) democrat, (2) libertarian, (3) green
	Student	IQ	Study hours	Gender	Political Party	Test score
	1	110	40	1	0	100
	2	110	40	0	2	95
	3	120	30	1	1	90
	4	110	40	1	3	85
	5	100	20	0	3	80
	6	110	40	1	1	75
	7	90	0	0	0	70
	8	110	40	0	2	65
	9	80	10	1	3	60
	10	80	10	0	1	55
The dummy variables are 1 when (zero otherwise):
x1: republican
x2: democrat
x3: libertarian
	Student	IQ	Study hours	Gender	x1	x2	x3	Test score
	1	110	40	1	1	0	0	100
	2	110	40	0	0	0	1	95
	3	120	30	1	0	1	0	90
	4	110	40	1	0	0	0	85
	5	100	20	0				80
	6	110	40	1				75
	7	90	0	0				70
	8	110	40	0				65
	9	80	10	1				60
	10	80	10	0				55