Data Analysis Project: A Proposal

profileZelda23

Objective: Students will conduct a simple statistical study to answer a research question. This is a research data analysis project in which you will apply the methods and concepts learned in this class to real life situations.

Data Analysis Project Proposal

  • Planning a study
  • Gathering data
  • Analyzing data; descriptive statistics

Managers often make decisions by studying the relationships between variables, and process improvements can often be made by understanding how changes in one or more variables affect the process output. Regression analysis is a statistical technique in which we use observed data to relate a variable of interest, which called the dependent (or response) variable, to one or more independent (or predictor) variables. The objective it to build a regression model or prediction equation, that can be used to describe, predict, and control the dependent variable on the basis of the independent variables. For example, a company might wish to improve its marketing process. After collecting data concerning the demand for a product, the product’s price, and the advertising expenditures made to promote the product, the company might use regression analysis to develop an equation to predict demand on the basis of price and advertising expenditure. Prediction of demand for various price-advertising expenditure combinations can then be used to evaluate potential changes in the company’s marketing strategies.

Explore public and government data/database to see what types of data are available. Decide on a research question to see the relationship between the variables. If you have personal data or data at work that you want to work on this project, you can do that as long as the data is not confidential. You will need to share your results with your classmates.

The data set can be any topic and should be on that you find interesting. 

Your data set should consist of at least three variables, one of which is the variable you want to predict from the others. You should avoid categorical response variables. It's OK if one of your predictor variables is categorical, but not both of them.

Please submit following. Please answer, briefly, each of the following questions in your write-up. Please submit MS Word Document with all information. 

  1. Describe your data set (including the source). Why are you interested in it? What do you hope to learn? Before exploring your data set, state some hypotheses (guesses) about how the variables should be related, perhaps based on your knowledge and experience, but please provide at least one reference to support your claim. Be sure to identify the response variable and the predictor variables.
  2. Make a scatterplot of your response variable (on the Y-axis) versus one of the predictor variables (on the X-axis). Describe the pattern you see. Is this pattern consistent with what you expected? Note any apparent outliers in the plot. Can you propose a "cause" for these outliers? Repeat the entire procedure for the other predictor variables.
  3. Can you think of any other variables (not in your data set) that might be useful in predicting Y? Try to list a few possibilities.
  4. For each variable, obtain descriptive statistics using Data Analysis Toolkit and create a histogram using Excel.
  5. For each variable, based on the descriptive statistics output, decide if your variable has normal distribution or not. Also, you can check a scatterplot to see the liner relationship. If you see the problem with your data, and if all of the data values for this variable are positive, try taking log of the variable. Then create the descriptive statistics graph for the log of the variable, and decide whether the problem is reduced. Please note that if a variable has any zero or negative values, then taking logs is NOT appropriate, so there is no point in trying it in this case. Excel tutorial on how to graph log-transformed data in Excel.
  6. Return the scatterplot (and answer the rest of question 2) using the logged variables wherever this was found appropriate in question # 5). Here are some examples of what I mean. If you decide to take logs of predictor variable X2 only, then you should run a scatterplot of your response variable (let's call it Y) against log(X2). If you decided to take logs of X2 and X3, then you should run scatterplots of Y versus log(X2) and Y versus log(X3). If you decided to take logs of Y only, then you should run scatterplots of log(Y) versus all of the (non-logged) predictor variables. If you decided not to take the log of any of the variables, you do not need to do anything. For each scatterplot you create here, compare it with the corresponding one from question 2). Did taking logs help you to uncover a relationship between the variables?


Please read the assignment in full. It is due July 25 by 7:00 pm. 


    • 8 years ago
    • 180
    Answer(1)

    Purchase the answer to view it

    blurred-text
    NOT RATED
    • attachment
      DataAnalysisProject.docx
    • attachment
      Plagiarismreport.pdf