Statistical Analysis Subject
SOUTHERN CROSS UNIVERSITYSchool of Business and Tourism MAT10251 Statistical Analysis |
PROJECT COVER SHEET
Please complete all of the following details and then make these sheets the first pages of your project – do not send it as a separate document.
Your project must be submitted as a Word document.
PART C
|
Student Name: |
Ashish Neupane |
|
Student ID No.: |
22754842 |
|
Tutor’s name: |
Dr Badri P. Bhattrai |
|
Due date: |
19th January,2018 |
|
Date submitted: |
23rd Jan 2018 |
Declaration:
I have read and understand the Rules Relating to Awards ( Rule 3 Section 18 – Academic Misconduct Including Plagiarism ) as contained in the SCU Policy Library. I understand the penalties that apply for plagiarism and agree to be bound by these rules. The work I am submitting electronically is entirely my own work.
.
|
Signed: (please type your name) |
Ashish Neupane |
|
Date: |
23nd January, 2018 |
|
|
STUDENT NAME: Ashish Neupane
STUDENT ID NUMBER:22754842 |
MAT10251 – Statistical Analysis
Project Part C
Complete the summary table below.
|
Sample Number (last digit of your student ID number) |
02 |
|
Level of Significance |
0.05 |
Value: 20%
|
|
|
PLEASE ENSURE YOU KEEP A COPY OF YOUR PROJECT
|
Marking and Feedback Sheet
Comments
Written Answer Part C
Delete the italic text and add your content.
Each answer below should:
· Introduce and put the question in context
· Include appropriate Excel output.
· Present the results of your procedures, intervals or tests without unnecessary statistical jargon
Question 1
100 to 200 words and 1 to 2 pages
Use your answer to Question 1:
Is there a difference in the average price of cars, of the specified make and model for sale in the specified state, for sale privately and by a used car dealer?
to provide a justified answer to your relative’s question. That is, is there a difference in price between cars sold by a dealer and those sold privately?
Questions 2 and 3
200 to 500 words and 2 to 4 pages
Use the simple and multiple linear regression models developed in Questions 2 and 3 to provide a linear model to predict the price of a used car from age and/or transmission type and/or odometer reading, to answer your relative’s question. That is, how the value of the car they purchase will depreciate?
· Explain choice of independent and dependent variables.
· Include your scatter plot and discuss any apparent relationship between price and age. Comment on the strength, shape and sign of the relationship.
· Include and justify the simple or multiple linear model which best fits the data.
· Discuss and interpret the values of the regression and correlation coefficients of the best model.
· Present the results without unnecessary statistical jargon.
· Provide an answer to your relative’s question. That is, how the value of the car that they purchase will depreciate?
As these answers are part of a letter or emails to a relative you can use informal or casual written language
Appendices Part C
Delete the italic text and add your content.
This section should include appropriate graphs, Excel output and any necessary steps for the required statistical tasks.
Tests should show full statistical working including
· Random variable/s defined
· Any required assumptions mentioned
· Statistical calculations, including Excel output
· Hypotheses and decision for tests
· Conclusion for any hypothesis test.
Appendix C.1 Statistical answer for Question 1
Is there a difference in the average price of cars, of the specified make and model for sale in the specified state, for sale privately and by a used car dealer?
Appendix C.2 Statistical answer for Question 2 and Question 3
Assumptions and Variables Defined
Define dependent and independent variables for both simple and multiple linear regression models.
Mention any assumptions required for the simple/multiple linear regression models.
Simple Linear Regression Model
· Develop a simple linear regression model
· Include interpretation of regression and correlation coefficients.
Multiple Linear Regression Model
· Develop a multiple linear regression model with three independent variables
· Include interpretation of multiple regression and correlation coefficients for the multiple regression model
· Determine which independent variables make a significant contribution to the regression model.
· State, and justify, the simple or multiple linear model which best fits the data.
Letter
Mr Ram Prasad Kuikel
38 RobinsonStreet
Riverstone NSW
19th January 2018
Dear Ramu,
I am writing this letter to say about the question you asked regarding the price differences between the cars which are in sale privately and car which are in sale through the used car dealer. It also will help to know how the price of cars will depreciate with the age, odometer, readings, and transmission.
While purchasing the cars it is very important to know the difference in price of the cars which is of same make and model. Here are some sample of cars available in states which will make your choice better.
The boxplot diagram below shows the price of cars sold privately and by used cars dealers.
No 1.
|
Z Test for Differences in Two Means |
|
|
|
|
|
Data |
|
|
Hypothesized Difference |
0 |
|
Level of Significance |
0.05 |
|
Population 1 Sample |
|
|
Sample Size |
89 |
|
Sample Mean |
15773.61798 |
|
Sample Standard Deviation |
4840.665989 |
|
Population 2 Sample |
|
|
Sample Size |
32 |
|
Sample Mean |
11741.1875 |
|
Sample Standard Deviation |
4132.159152 |
|
|
|
|
Intermediate Calculations |
|
|
Difference in Sample Means |
4032.430478 |
|
Standard Error of the Difference in Means |
892.6741 |
|
Z Test Statistic |
4.5172 |
|
|
|
|
Two-Tail Test |
|
|
Lower Critical Value |
-1.9600 |
|
Upper Critical Value |
1.9600 |
|
p-Value |
0.00001 |
|
Reject the null hypothesis |
|
|
|
|
|
Upper-Tail Test |
|
|
Upper Critical Value |
1.644853627 |
|
p-Value |
0.000003 |
|
Reject the null hypothesis |
|
|
|
|
|
Lower-Tail Test |
|
|
Lower Critical Value |
-1.644853627 |
|
p-Value |
0.999997 |
|
Do not reject the null hypothesis |
|
Confidence Interval Estimate |
|
|
for the Difference Between Two Means |
|
|
|
|
|
Data |
|
|
Confidence Level |
95.00% |
|
|
|
|
Intermediate Calculations |
|
|
Z Value |
1.9600 |
|
Interval Half Width |
1749.609066 |
|
|
|
|
Confidence Interval |
|
|
Interval Lower Limit |
2282.821411 |
|
Interval Upper Limit |
5782.0395 |
From the p-value =0.001>0 = a using the p-value approach, we do not reject the null hypothesis. The difference in price of the car for sale privately and for sale by a used car dealer is different than the probability of getting the difference in our sample is 0.0000000. This is a likely even. Therefore, it is proved that my sample provides actual difference in the average price of the cars of the same make and model in the specified state for sale privately and by a used car dealer.
Age of car is indicated by the price so it is positively influence price of car so I would expect the price to be dependent on the age of the care. So I am constructing a linear model with price as a dependent factor and age as independent factor to enable to predict the price from the age of car.
The scatter diagram below shows the relationship between age and price of used car. As expected this graph shows that the sample the price of the car will be high with the less age of the car. As expected, this diagram shows negative relationship and is approximately linear which does not shows positive relation between price and age.
|
SUMMARY OUTPUT |
|
|
|
|
|
Regression Statistics |
|
|
Multiple R |
0.877278 |
|
R Square |
0.769616 |
|
Adjusted R Square |
0.76768 |
|
Standard Error |
2399.539 |
|
Observations |
121 |
As expected from the scatter plot, the correlation coefficient 0.877 showing that there is negative linear relation between price and age of cars.
|
SUMMARY OUTPUT |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
|
|
|
|
|
Multiple R |
0.877278 |
|
|
|
|
|
|
|
|
R Square |
0.769616 |
|
|
|
|
|
|
|
|
Adjusted R Square |
0.76768 |
|
|
|
|
|
|
|
|
Standard Error |
2399.539 |
|
|
|
|
|
|
|
|
Observations |
121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
|
Regression |
1 |
2.29E+09 |
2.29E+09 |
397.5287 |
9.66E-40 |
|
|
|
|
Residual |
119 |
6.85E+08 |
5757789 |
|
|
|
|
|
|
Total |
120 |
2.97E+09 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
Lower 95.0% |
Upper 95.0% |
|
Intercept |
20986.63 |
383.1138 |
54.77909 |
2.93E-86 |
20228.02 |
21745.23 |
20228.02 |
21745.23 |
|
Age |
-1274.85 |
63.94043 |
-19.9381 |
9.66E-40 |
-1401.46 |
-1148.24 |
-1401.46 |
-1148.24 |
The p-values of two independent variables such as age and odometer in the above table are less than 0.05. This indicates that both age and odometer make a significant contribution to the model and should be included. That is, adding age and odometer has resulted in a stronger model.
Whereas the p-value of one independent variable such as transmission is more than 0.05 which does not make a significant contribution to the model and should not be included.
Therefore, the multiple regression model:
Price = 20986.63– 997 x age (year) – 0.05 x odometer (kms) +432 x transmission
obtained from the table above will allow you to estimate relationship between these for variables .
However, while the equation above will predict price of car this prediction is not very accurate, since from the correlation coefficient the strength of the relationship is not strong. In particular, the coefficient of determination, r2= 0.7696. That is other factors will also influence your price of the car.
Please contact me if you want any further information.
Yours sincerely ,
Ashish Neupane
Appendix C
Appendix C1. – Statistical answer for question 1
Hypothesis test difference means two independent samples
To answer question 1
Let,
X = average price for private sale
X1 = average price for dealer sale
Then, µ = mean average price for private sale
µ1 = mean average price for dealer sale
x and x1 are independent, with n = 32 and n1 = 89
Choice of two with justification
The boxplots below indicate that the distribution of the population of price of private and dealer car of Corolla X-Trail is normal.
If using a z-test
Since have large samples Central Limit Theorem applies. Therefore, the sampling distribution is approximately normal, and the z-test for difference of two independent means can be used to test if there is a difference in the average price of Nissan for private and dealer.
The both sample size are larger than central limit theorem applies (CLT).
Hypotheses
Use level of significance of 5%
Calculation
Excel output of independent two tail z-test.
|
Data |
|
|
Hypothesized Difference |
0 |
|
Level of Significance |
0.05 |
|
Population 1 Sample |
|
|
Sample Size |
89 |
|
Sample Mean |
15773.61798 |
|
Sample Standard Deviation |
4840.665989 |
|
Population 2 Sample |
|
|
Sample Size |
32 |
|
Sample Mean |
11741.1875 |
|
Sample Standard Deviation |
4132.159152 |
|
|
|
|
Intermediate Calculations |
|
|
Difference in Sample Means |
4032.430478 |
|
Standard Error of the Difference in Means |
892.6741 |
|
Z Test Statistic |
4.5172 |
|
|
|
|
Two-Tail Test |
|
|
Lower Critical Value |
-1.9600 |
|
Upper Critical Value |
1.9600 |
|
p-Value |
0.00001 |
|
Reject the null hypothesis |
|
|
|
|
answers for question 2.
From the scatter plot the assumption
Let,
X = car age (independent variable)
Y = car price (dependent variable)
Decision
Since p-value = 0.00001do not reject the null hypothesis at any level of significance.
Conclusion
Therefore, the sample provides no evidence at any level of significance.
There is negative relation between price and age.
Equation and Coefficients
From Scatter Plot,
Equation and Coefficients
From scatter plot or regression output
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
Lower 95.0% |
Upper 95.0% |
|
Intercept |
20986.63 |
383.1138 |
54.77909 |
2.93E-86 |
20228.02 |
21745.23 |
20228.02 |
21745.23 |
|
Age |
-1274.85 |
63.94043 |
-19.9381 |
9.66E-40 |
-1401.46 |
-1148.24 |
-1401.46 |
-1148.24 |
|
Regression Statistics |
|
|
Multiple R |
0.877278 |
|
R Square |
0.769616 |
|
Adjusted R Square |
0.76768 |
|
Standard Error |
2399.539 |
|
Observations |
121 |
Correlation coefficient
R=0.877278
R^2=0.769616
Gradient: b1 = -1274.85 shows that on every additional year on Corolla in Victoria the sale will be reduce by 1274.85.
Variable intercept: b0 =20986.63
If the age of the car Corolla is zero, price would be 20986.63
Interpretation of correlation coefficient
Correlation coefficient: r = 0.877278
R is close to 1 therefore it shows the strong negative linear relationship between price and age.
Coefficient of determination: r2 = 0.769616= 76.96%.
Indicates that approximately 81.93% of the variation in the price is explained by the age.
Question no .3
EXCEL OUTPUT FOR QUESTION NO.3
|
Regression Statistics |
|
|
|
|
|
|
|
|
|
Multiple R |
0.799615 |
|
|
|
|
|
|
|
|
R Square |
0.639384 |
|
|
|
|
|
|
|
|
Adjusted R Square |
0.636354 |
|
|
|
|
|
|
|
|
Standard Error |
3002.093 |
|
|
|
|
|
|
|
|
Observations |
121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
|
Regression |
1 |
1.9E+09 |
1.9E+09 |
210.9909 |
4E-28 |
|
|
|
|
Residual |
119 |
1.07E+09 |
9012561 |
|
|
|
|
|
|
Total |
120 |
2.97E+09 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
Lower 95.0% |
Upper 95.0% |
|
Intercept |
20485.19 |
482.4053 |
42.46469 |
9.67E-74 |
19529.98 |
21440.4 |
19529.98 |
21440.4 |
|
Odometer (kms) |
-0.08071 |
0.005557 |
-14.5255 |
4E-28 |
-0.09172 |
-0.06971 |
-0.09172 |
-0.06971 |
Let, A= Age- independent variable
O= odometer-independent variable
T= transmission-independent variable (1=Automatic, O=Manual)
Y= price, $-dependent variable
Price and Age
Price
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 11 11 12 12 12 13 16 16 16990 16990 16990 17990 17990 19990 15990 15991 16250 16871 16880 16990 16990 16990 16990 17700 17888 17962 17990 18990 18990 18990 20990 21990 23501 13821 15480 16900 16990 17880 18850 18990 12888 13500 14500 14990 15880 16500 16750 17700 20800 21990 19320 19990 19990 20990 21990 22888 24990 24990 25990 25990 28990 18800 19338 19990 14980 15990 15990 15990 16990 17990 13741 14555 14650 14990 16595 16990 11123 11990 13990 14750 9888 13199 13990 7990 8150 10870 11500 11990 12990 13990 15990 12000 12990 9500 10990 11800 12880 11500 11990 9000 9800 9990 10990 10999 11990 12000 6888 8201 8500 9100 9500 9700 10450 10990 11990 7490 7995 9980 10995 8887 8998 5500 7990 4490 6888 5990 5600 5750 6900
Price and Age
Price
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 11 11 12 12 12 13 16 16 16990 16990 16990 17990 17990 19990 15990 15991 16250 16871 16880 16990 16990 16990 16990 17700 17888 17962 17990 18990 18990 18990 20990 21990 23501 13821 15480 16900 16990 17880 18850 18990 12888 13500 14500 14990 15880 16500 16750 17700 20800 21990 19320 19990 19990 20990 21990 22888 24990 24990 25990 25990 28990 18800 19338 19990 14980 15990 15990 15990 16990 17990 13741 14555 14650 14990 16595 16990 11123 11990 13990 14750 9888 13199 13990 7990 8150 10870 11500 11990 12990 13990 15990 12000 12990 9500 10990 11800 12880 11500 11990 9000 9800 9990 10990 10999 11990 12000 6888 8201 8500 9100 9500 9700 10450 10990 11990 7490 7995 9980 10995 8887 8998 5500 7990 4490 6888 5990 5600 5750 6900
Max
Marks
Mark
Cover sheet or sample incorrect-2
Incorrect format, including file name-2
Statistical Inference Question 1
Choice of technique, assumptions & other required 5
Calculation (Excel output)3
Decision and conclusion2
Regression and Correlation
Assumptions and random variables defined2
Simple Linear Model Question 2
Scatter plot3
Equation and coefficients2
Interpretation of regression & correlation coefficients 2
Multiple Linear Model Question 3
Equation, Coefficients and p-values3.5
Interpretation of regression & correlation coefficients3.5
Statistical Inference
Choice of technique and other required steps2
Decision and conclusion2
Best model1
Total Statistical Calculations310.0
Written Answer
Question 1
Introduction, discussion and results2
Question 2 & 3
Introduction1
Interpretation of scatter plot2
Introduction and discussion of best model2
Structure, grammar and spelling2
Total Written Answer90.0
Total Part C400.0
Sheet1
| Max Marks | Mark | |
| Cover sheet or sample incorrect | -2 | |
| Incorrect format, including file name | -2 | |
| Statistical Inference Question 1 | ||
| Choice of technique, assumptions & other required steps | 5 | |
| Calculation (Excel output) | 3 | |
| Decision and conclusion | 2 | |
| Regression and Correlation | ||
| Assumptions and random variables defined | 2 | |
| Simple Linear Model Question 2 | ||
| Scatter plot | 3 | |
| Equation and coefficients | 2 | |
| Interpretation of regression & correlation coefficients | 2 | |
| Multiple Linear Model Question 3 | ||
| Equation, Coefficients and p-values | 3.5 | |
| Interpretation of regression & correlation coefficients | 3.5 | |
| Statistical Inference | ||
| Choice of technique and other required steps | 2 | |
| Decision and conclusion | 2 | |
| Best model | 1 | |
| Total Statistical Calculations | 31 | 0.0 |
| Written Answer | ||
| Question 1 | ||
| Introduction, discussion and results | 2 | |
| Question 2 & 3 | ||
| Introduction | 1 | |
| Interpretation of scatter plot | 2 | |
| Introduction and discussion of best model | 2 | |
| Structure, grammar and spelling | 2 | |
| Total Written Answer | 9 | 0.0 |
| Total Part C | 40 | 0.0 |