assignment work (AVD)
1/27/20 Assignment 2 Finding the right data.docx P a g e | 1
Research Assignment 2: Finding the right data This document is provided to help you work through this robust research question. This document will offer
insight in how to address these fields, by working with three of fields and some tips for working with the data in
R. Work through fields one piece at a time, shown here in bold.
This research question requires several filters to obtain the secondary data sample that is necessary for this
research. Consider the field that represents whether a survey respondent has used the SO job board or is
aware of the board, but has never used it. When you identified the variable that represents this field, what
unique values can be found in the field?
The question has limited the scope of the data sample to respondents that use it and those that are aware of it,
but have never used it. We don’t have enough information! What is the survey question associated with these
survey answers?
Can you determine what unique answers or values you need to keep now?
Respondents that use it: how would they answer this survey question? Yes
Respondents that don’t use it, but know about the job board? No, I knew…
When you have lengthy character fields, it can get cumbersome to isolate the strings. If even one letter is out
of place, your filter will not perform as expected. Using the entire data frame of 88,883 observations the first
chunk returns zero observations; the second chunk returns 75,532 observations.
While the second options works, you may find that method to be slow to type and problematic due to typing
errors. How about this approach?
The function str_detect() or string detect searches a character string for the contents.
To keep it simple, don’t use the first word or last word of the string. Want to know why or more? Email me or
use help in RStudio.
What are the most influential features when predicting whether a survey respondent has used the SO job board
or is aware of the board but has never used it when considering respondents who reported residing in the
country …(your countries listed here); reported their age as somewhere between 18 and 65 years old; and that
indicated that they were either not at all, somewhat, or very confident in their manager; reported an
undergraduate major in either an engineering field, information systems, or web design, or statistics; in
addition to the responses these respondents reported regarding employment; how often the respondent
contributes to open source; and whether or not they code for a hobby; when the respondent indicated that
the number of years they have been coding is somewhere within one to 49 years using the data from SO
(2019)?
> unique(df$SOJobs) [1] "No, I didn't know that Stack Overflow had a job board" [2] "No, I knew that Stack Overflow had a job board but have never used or visited it" [3] "Yes" [4] NA
Have you ever used or visited Stack Overflow Jobs?
df <- filter(df, # ‘and’ statement
SOJobs == "Yes",
SOJobs == "No, I knew that Stack Overflow had a job board but have never used or visited it")
df <- filter(df, # ‘or’ statement
SOJobs == "Yes"|
SOJobs == "No, I knew that Stack Overflow had a job board but have never used or visited it")
df <- filter(df, str_detect(SOJobs, "knew")|SOJobs == "Yes")
1/27/20 Assignment 2 Finding the right data.docx P a g e | 2
Let’s look at another field from the research question: reported an undergraduate major in either an
engineering field, information systems, or web design, or statistics.
What unique values exist for this field?
Do you need the survey question? Maybe not. You do need to capture the questions in the sample section.
Here I’ve highlighted the words coinciding with the research question and truncated the list.
What’s the best method to approach this? Think about how you could use the string detect function here.
When you pick the words to detect, consider all the unique values. How can you validate the filter worked like
you had intended? Try using the function table() before and after your filter. The function table() will
alphabetize your unique field names, so they won’t appear in the same order as they did with unique(). Add
the Check the remaining fields and how often each of the fields occur in the data.
> unique(df$UndergradMajor) [1] <NA> [2] Web development or web design [3] Computer science, computer engineering, or software engineering [4] Mathematics or statistics [5] Another engineering discipline (ex. civil, electrical, mechanical) [6] Information systems, information technology, or system administration [7] A business discipline (ex. accounting, finance, marketing) [8] A natural science (ex. biology, chemistry, physics) [9] A social science (ex. anthropology, psychology, political science) [10] A humanities discipline (ex. literature, history, philosophy) [11] Fine arts or performing arts (ex. graphic design, music, studio art) [12] A health science (ex. nursing, pharmacy, radiology) [13] I never declared a major
> unique(df$UndergradMajor) [1] <NA> [2] Web development or web design [3] Computer science, computer engineering, or software engineering [4] Mathematics or statistics [5] Another engineering discipline (ex. civil, electrical, mechanical) [6] Information systems, information technology, or system administration [7] A business discipline (ex. accounting, finance, marketing)
df <- df %>% filter(str_detect(UndergradMajor, "engineering")| str_detect(UndergradMajor, "information")| str_detect(UndergradMajor, "statistics")| str_detect(UndergradMajor, "web")) %>% droplevels() # droplevels() is necessary when dropping factor levels
> table(df$UndergradMajor) A business discipline (ex. accounting, finance, marketing) 1841 A health science (ex. nursing, pharmacy, radiology) 323 A humanities discipline (ex. literature, history, philosophy) 1571 A natural science (ex. biology, chemistry, physics) 3232 A social science (ex. anthropology, psychology, political science) 1352 Another engineering discipline (ex. civil, electrical, mechanical) 6222 Computer science, computer engineering, or software engineering
> table(df$UndergradMajor) Another engineering discipline (ex. civil, electrical, mechanical) 6222 Computer science, computer engineering, or software engineering 47214 Information systems, information technology, or system administration 5253 Mathematics or statistics 2975 Web development or web design 3422
1/27/20 Assignment 2 Finding the right data.docx P a g e | 3
When you think it’s a field with numeric values, but it isn’t. From the research question:
when the respondent indicated that the number of years they have been coding is somewhere within
one to 49 years
Pay attention to what these fields contain. If the data type is a character or factor field, you cannot use numeric
values to filter. Look at what happens here when working with the entire data set.
Technically they filter for the same information. Are either one correct? Nope. What unique values exist?
Do you see the quotation marks? R is interpreting every value here as strings or words, not numbers.
Which function call is correct?
The first returns 87,938. The second returns 86,445. What’s happening here? Because these filters are using
!= or does not equal, you have to consider how that impacts whether you use a comma or the vertical pipe |.
For this example, you can see the difference when you use unique().This shows the number of unique values.
An odd set of outcomes, right? How did it go from 53 to 52 and from 53 to 50? There were two labels filtered
out in both function calls, yet neither removed two unique values.
Both the ‘or’ and ‘and’ statements removed NA. The ‘or’ statement did not remove either string completely. The
AND statement is needed here. The field still has to be converted to a numeric type, then filtered for the range
in the research question.
> nrow(filter(df, YearsCode >= 1, YearsCode <= 49)) [1] 59127 > nrow(filter(df, YearsCode != "More than 50 years", YearsCode != "Less than 1 year")) [1] 86445
> unique(df$YearsCode) [1] "4" NA "3" "16" [5] "13" "6" "8" "12" [9] "2" "5" "17" "10" [13] "14" "35" "7" "Less than 1 year" [17] "30" "9" "26" "40" [21] "19" "15" "20" "28" [25] "25" "1" "22" "11" [29] "33" "50" "41" "18" [33] "34" "24" "23" "42" [37] "27" "21" "36" "32" [41] "39" "38" "31" "37" [45] "More than 50 years" "29" "44" "45" [49] "48" "46" "43" "47" [53] "49"
df <- filter(df, YearsCode != "Less than 1 year"| # using an ‘or’ statement YearsCode != "More than 50 years") df <- filter(df, YearsCode != "Less than 1 year", # using an ‘and’ statement YearsCode != "More than 50 years")
> length(unique(df$YearsCode)) # unfiltered original data [1] 53 > length(unique(df$YearsCode)) # filtered with the or statement above [1] 52 > length(unique(df$YearsCode)) # filtered with the ‘and’ statement above [1] 50
1/27/20 Assignment 2 Finding the right data.docx P a g e | 4
What range of data is available for this field that represents the number of years the respondent has been programming, after the ‘and’ statement?
There is one more step you need to take, before you’re done with this field. You changed it, now validate those
last changes. How do you know it’s correct?
Beyond the three variables shown in this document so far, if you run into trouble trying to model this data in
your analysis, there are two things you can look for.
The first: is your data a regular data frame? Not a tbl_df or tibble?
Convert it to a data frame df <- as.data.frame(df).
The second: did you validate the changes you made? What does a summary of your data look like?
How many levels do your factors have? Is it clean? If the function summary returns factor levels with a count of
0, your data is not clean!
If you were filtering for Australia, Russia, and the Netherlands, along with only full-time and part-time workers,
and the use of summary() returned the following:
Your data is not clean! If you attempt to train a random forest model unclean data, as shown, it could take
hours to process and may finish with an error. Clean the data first. Empty levels? Try this:
The final caveat: It will make reading the confusion matrices a lot easier to read if you change the labels of the
outcome variable before training your model.
> range(df$YearsCode %>% as.numeric()) [1] 1 50 # after the ‘and’ statement, the field is 1:50 # the field needs to be permanently changed to numeric # the field will need to be filtered again > df$YearsCode <- as.numeric(df$YearsCode) > df <- filter(df, YearsCode >= 1, YearsCode <= 49)
Employment Country Employed full-time :2252 Australia :845 Employed part-time : 108 Russian Federation:764 Independent contractor, freelancer, or self-employed: 0 Netherlands :751 Not employed, and not looking for work : 0 Afghanistan : 0 Not employed, but looking for work : 0 Albania : 0 Retired : 0 Algeria : 0 (Other) : 0
> df <- droplevels(df) # no more empty factor levels! # after dropping the empty factor levels, summary returns this > summary(df) Employment Country Employed full-time:2252 Australia :845 Employed part-time: 108 Netherlands :751 Russian Federation:764
> levels(df$SOJobs) # filtered SOJobs for analysis has two levels [1] "No, I knew that Stack Overflow had a job board but have never used or visited it" [2] "Yes" > levels(t2$SOJobs) <- c("No","Yes") # change labels; order of labels matters
- Research Assignment 2: Finding the right data