FOUNDATIONS OF DATA SCIENCE

profilelouxiadefengjing
Expectationsintheproject.docx

Expectations in the project

· Explain overall your project and what you want to achieve at the end

· Why you are analyzing this dataset, what value does it bring (what you have stated in your proposal)

· Explain the dataset

· State which columns are explanatory variables, and which column is the outcome variable

· Do cleaning, data transformation, data preparation. You do not have to use all the columns, you can drop some of them. Explain which ones you dropped and why, if any.

· Create statistical summary of your data

· Create proper visualizations of your variables depending on the data type. At least 4-5 different figures. Explain them

· Answer the 3-4 research questions that you have posed in your proposal. This can be done by checking the statistical summaries, figures, correlations etc. These questions can also be explained by applying regression or classification, depending on what it is

· Based on the outcome variable data type either perform regression analysis or classification analysis

· Split your data into training and testing portions. Typically, 70%-30% is good. Then fit your model onto the training portion where the model will learn from the data. Then, test your model on the testing set.

· Explain how you built the model.

· For example, if it is regression start with only one variable and add more to assess the improvement. You can use the adjusted R2 on the training set to see the improvement. Or you can calculate the RMSE on the testing portion to check for the improvement.

· In classification you can decide on a classification model or more than one and compare their performance. Here you can use the accuracy of the testing portion to assess the model performance. Also talk on any parameter tuning. For example, in decision trees choosing different values for minimum number of datapoints in a node before another split can be made is a hyper-parameter that you need to set and experiment with.

· Talk on the results, whether you were expecting these, or whether the results are contrary to your expectations. Provide some insights.

· For classification once you decide on the best model it would be good to create a confusion matrix for that model on the testing set

· What more could you do on this data if you had more time?

· In the last class on 11/7/2019 I did classification on the auto gas mileage data we have been using. Basically, all the modules there that was used constitutes a project and about what I expect. The KNIME file is also uploaded. That should give you a good guidance. You can also watch the corresponding weeks’ lectures to see how this project was built up.

These are expected to be shown both on your presentation and report