Analyze data using R Programming

profileJacob009
Assignment2Netherlands.pdf

5/20/20 Assignment 2 Netherlands.docx P a g e | 1

Assignment 2 Tip: Read through this document in its entirety before you begin.

The assignment is to conduct research based on the information below, using R. After analyzing the data in R,

document the research and findings in a research paper in APA 7 format. Ask questions, if needed.

Topic: Stack Overflow hosts an annual survey for developers. The study for 2019 includes almost

90,000 respondents (Stack Overflow, n.d.a).

Problem: Surveys usually contain instructions for participants that direct them to answer to the best of

their ability. Inherently, this expectation of honest answers equates to consistent responses.

Inconsistency can arise in a variety of ways, how one person interprets the question, versus the

next, is one example. Another example is when the answers are multiple-choice, and more than

one or none of the choices are appropriate to that respondent. In the study by Stack Overflow

(n.d.b), respondents answered questions about employment and employment-related questions

inconsistently. Modeling the survey results can present new insight into these inconsistencies.

Question: Using a neural network and a random forest model and the Stack Overflow (n.d.b) data, will the

survey responses to employment, developer status, and coding as a hobbyist, along with the

answers to an open-source sharing question provide sufficient information to predict how the

participant responded to the question about their student status?

Data:

• The data and data dictionaries are online.

o Note: The raw data in your program must be in the original form. Do not modify the data outside

of the programming. Use the data dictionary to understand the data.

o You can read Stack Overflow’s (n.d.a) report on the survey.

▪ Stack Overflow. (n.d.a). Developer survey results: 2019. Retrieved May 24, 2020, from

https://insights.stackoverflow.com/survey/2019

o The data and data dictionary are downloaded together. When you visit this site, ensure you

select the 2019 survey:

▪ Stack Overflow. (n.d.b). Stack overflow annual developer survey [dataset and code

book]. Retrieved May 24, 2020, from https://insights.stackoverflow.com/survey/

Requirements for this data analysis project:

• Develop at least one additional well-developed research question.

• When conducting data analysis, limit your research to the country of Netherlands.

• Develop two classification algorithms, a neural network, and a random forest classifier. Attempt to

create a classification model with an accuracy that exceeds 0.8 and the no-information-rate, when

predicting the testing dataset. Tune the model(s), if they do not meet the sensitivity threshold. Compare

the two models’ accuracy.

• Do not forget to address the problem. **

• Explore the insights you can gain from this model and provide your interpretations when documenting

your research.

5/20/20 Assignment 2 Netherlands.docx P a g e | 2

Required files to submit:

1) Research paper in APA 7 format; MS Word document file type

2) R Script; final version

Bonus challenge:

Beyond the metric accuracy, explore the influence of the high no-information-rate in this analysis. The idea is

for you to discover how the accuracy can be misleading, or when a higher accuracy score as a whole, may

cover up the accuracy of individual labels in unevenly distributed labels.

This challenge is specific to this data. Do not provide generic descriptions of the metrics; I am not interested in

generic.

Tips:

• MainBranch is the variable name for developer status.

• There is a difference between OpenSourcer and OpenSource; make sure you understand which

variable applies.

• There will be four predictor variables and one outcome variable with three classes.

• Make sure that you look at the frequency of potential responses. For example, if you look at this

summary of Employment, the answer Retired only has six observations associated with it. What

would occur if all six were in the test set? Using the frequency threshold of 20, omit responses from the

models’ data, if necessary.

o *If this type of inconsistency exists, it may be easier to do so while the data type or class is

character.

Good to know:

• When submitting in Blackboard, you may receive an error, because the R file type is not recognized.

That is okay. It is only indicating that SafeAssign cannot evaluate that part of your submission.

o The research paper will be written in a professional writing style, following APA 7 student paper

format; you can use the student paper template.

o The document shall be 3-5 pages or at least 800 words. The page count does include the cover

page or reference page.

o Ensure that every reference in your reference list is also cited in the text. Do not forget to cite

and reference the source of the data.

• When developing your research paper, you may modify the topic and problem statement. However, the

minimum requirements for the method of analysis cannot be altered.

• Ensure that you make the research yours and complete this assignment independently.

• There are several different versions of this assignment. If you complete a version of this assignment

that is not available to you in Blackboard, you will violate your pledge.

Employment Employed full-time :1764 Employed part-time : 108 Independent contractor, freelancer, or self-employed: 218 Not employed, and not looking for work : 130 Not employed, but looking for work : 96 Retired : 6