Statistics question

profileHailsldjfu
2244AFW21Assignment2-Data.pdf

Biol/Stat 2244A FW21 – Assignment 2

1

BIOL/STATS 2244 Assignment 2:

Data

Objectives

This Assignment is designed to demonstrate your current mastery of the following Learning Outcomes:

i. Communicate statistical concepts, analyses, and arguments in an accurate and scholarly manner;

a. Apply vocabulary to describe statistical concepts, procedures, and ideas; b. Apply conventional formats for reporting and interpreting results of statistical analyses

in written/graphical form. ii. Create and interpret appropriate summaries of data;

a. Select summaries based on research question and variables. b. Interpret summaries to identify and/or describe patterns, trends, and interesting

features in data iii. Use statistical software to explore, summarize, analyse, interpret, and communicate data;

a. Use R to create and modify graphical and numerical summaries of data To achieve these objectives, students will need to draw on course material from the topics primarily from Planning Ahead: Sampling variability, and Summarizing & Exploring Data in R.

How this Assignment ‘works’

This Assignment is the second of four Assignments in the course; it continues our progression through the phases of the PPDAC Framework. The major focus of this Assignment is the Data stage of PPDAC. In the first Assignment, you were introduced to some ‘Research Background’ related to academic integrity and personality characteristics; we will continue to work with this background (you can refer back to it if necessary for this assignment). You may recall that you designed a sampling and study design to address a very specific research question in Assignment 1 as part of the “Problem” and “Plan” phases of the PPDAC Framework. The next phase of the PPDAC framework, the “Data” phase, would involve collecting data and exploring data from our Plan; unfortunately, we don’t have the time/resources for you to implement the Plan you personally devised. So, instead, we will use an openly available dataset that deals with the research objective introduced in Assignment 1. To help you understand both the way the data were collected and what the various variables represent, this Assignment 2 file is accompanied by both the datafile (named assignment.csv) that you will be working with, as well as a detailed metadata file. You should, therefore, review the rest of this Assignment file so you understand what you are being asked to do, and then spend some time reviewing the metadata file, while exploring the datafile in R. Note carefully: you will not use every variable and/or every data point in the datafile for this Assignment! Some variables in the datafile will be relevant to this Assignment, some will be used for Assignments 3 and 4, some variables will be used repeatedly, and other variables may not be used at all!

Biol/Stat 2244A FW21 – Assignment 2

2

As was the case for Assignment 1, you will find a “Research Prompt” section below that repeats the Research Objective that serves as the connecting ‘theme’ for the four Assignments in the course. It will also provide you with the Research Question that is the focus of this particular Assignment; note carefully that it may have changed from the Research Question given in Assignment 1. And finally, while we are ‘imagining’ that we have collected the data that you are working with (i.e. you can/should pretend that you are the researcher than obtained this data), always keep in mind that this is still an Assignment in Biol/Stat 2244 for which you are trying to obtain credit by demonstrating your mastery of course concepts. This Assignment will be graded based on the degree to which your submission meets the criteria listed in the “Grading for Assignment 2” section, which is provided after the Assignment Questions section. So, it is in your best interest to carefully review that grading section, so you understand how you should structure your answers for maximal success/credit.

Research Prompt and Question

You are a researcher with the Department of Psychology at Western University (London, ON Canada). You regularly collaborate with other researchers in the Faculty of Education. Your research interests and expertise are in the area of academic integrity and personality traits, with a special interest in personality and behaviour related to psychopathy. You have conducted previous research investigating personality traits and characteristics that may motivate individuals to engage in academic misconduct. Your current research program is centred around extending that research to understand which personality characteristics associated with psychopathy are specifically involved in the motivation to engage in academic dishonesty.

You have a research lab (with graduate students and research fellows who can help you with your research, e.g. as individuals helping to collect data if necessary) with funding that you are using to conduct research to further the following Research Objective:

Understand the personality traits and demographic characteristics (e.g. gender, age, etc.) that motivate and are associated with engaging in academic dishonesty.

Currently, you are planning a research study to specifically address the following Research Question:

Does greater tendency towards psychopathic personality traits influence the likelihood that students engage in academic misconduct?

Tips for success/Approaches to take for this Assignment This is the first Assignment that involves using R and the R markdown file format. There are some things you should definitely do to reduce your stress and make sure that you are on the path to success. Working with your R markdown file:

1. Start answering your assignment directly in an R markdown file, rather than drafting answers in a Word file or some other file format. Copying/pasting from Word (or Pages) into an R markdown file can introduce problems in code and text formatting that can be really cryptic to uncover and resolve.

2. Use the regular text sections of R markdown to provide written answers and use the R chunks to write your code, applying only brief #comments within the R chunks to help us understand your

Biol/Stat 2244A FW21 – Assignment 2

3

code. Seriously, anything that involves writing a sentence should be done in the ‘text’ section of the R markdown file. Don’t try to force sentences in the R Chunk #comments.

3. Try to import your data into your R markdown file right away (i.e. have your read.csv() function IN the R markdown file). Better to get that sorted out right from the start.

4. As you write code in your R markdown file, constantly try to knit (i.e. do not just ‘run’ an R chunk) your file to PDF. Waiting until you are finished your assignment and trying to knit at the

last minute can be stressful, especially if there are many ‘errors’ that stop it from working. Deal

with errors as soon as they arise.

Suggestions for getting started with the datafile:

1. It’s a good idea to ‘inspect’ your datafile right from the start. Make sure it imported properly (i.e. use the quality check in the metadata file to confirm). You do not need to include the results of the quality check in your final Assignment submission.

2. Investigate how R ‘interpreted’ the data and variables during the import process. One handy function is str(dataframe), where the argument is the name of your dataframe (i.e. what you saved the file as when you imported it; I’d suggest something short like a2). This function is handy because it gives you a brief snapshot of all the variables, and tells you what type of variable R has ‘coded’ for each of your variables (i.e. character, numeric, integer, factor). It might be handy to think carefully about what type each variable really is (i.e. based on the descriptions in the metadata file) and see whether R has interpreted the variables accurately. You do not need to keep the results of this function in your final Assignment submission; it’s just an initial check for you.

General tips for working with R

1. If you get stuck in R, ask questions right away. You can post to the Forums under the “Assignment 2” section; no one will see your post except me because it is a moderated Forum section. Note that even if I “deny” your posting (so that other students cannot see it), I will still leave a reply. You can also get help through the R Help Sessions, my Student Hours, and/or at various Assignment 2/Help Sessions with TAs that occur in person/online during the 3-h lab period.

2. It’s a good idea to plan out/think about what you want to do (for data transformations, building a graph, etc.) before trying to achieve it in R. For example, pretend you are completing the data transformation and graphing by hand or on paper. Figure out what you would do/create so you have the steps or what you want, then build it in R.

3. It’s a good idea to start ‘simple’ with R; that is, see if you can get something simple to work, and then progressively add more complexity/customization to it if necessary.

4. Keep the names of your files, variables, dataframes simple and short. Some tips/things to keep in mind:

• R is case sensitive; that is, Data and data are recognized as two different things by R. I’d suggest sticking to all lowercase (i.e. common) letters to make things simple, but that’s a personal preference.

• Symbols like *, /, !, etc. have special meanings in R. Don’t use them in your names.

• R doesn’t like spaces in names; if you really want to show a space, use an underscore, e.g. first_data rather than first data

Biol/Stat 2244A FW21 – Assignment 2

4

• R wants your names to start with a letter, not a number. So, use data1 rather than 1data; similarly, R does NOT like symbols in names, so don’t make new variables, filenames, etc. like data*1

• There are certain words that R uses (i.e. reserves) as the names of functions or arguments that you should avoid for your names; these include: c, data, file, function, subset. So, if you want to call something ‘data’ for example, then call it ‘mydata’ instead.

Biol/Stat 2244A FW21 – Assignment 2

5

Assignment Questions

Note carefully! Before you start writing your answers to the Assignment questions, you should review the sections on Grading of Assignment 2 and Format of Assignment 2 (at the end of this file) so that you can write your answers in a way that will demonstrate your mastery of the course material, while addressing the criteria for a good answer. In particular, one of the objectives of this Assignment is for you to apply conventional formats for reporting and interpreting results of statistical analyses in written/graphical form. You had an entire topic within our lectures assigned as self-study related to proper graph format and information which you will want to review as you prep your answers (in particular for Question 2). Another objective of this Assignment is for you to apply vocabulary to describe statistical concepts, procedures, and idea. Question 1 has some conceptual parts to it (in addition to some data exploring using R); likewise, Question 2c may provide opportunity to demonstrate use of vocabulary. You should endeavor to integrate relevant statistical vocabulary into your answer. Show me you understand the concepts and can use the language correctly!

Question 1. In most research involving humans, researchers describe the ‘demographics’ of their sample (i.e. age, gender, and other characteristics, etc.). This description often occurs because the researchers want to highlight the degree to which their sample represents (or possibly deviates from) the population of interest, and/or, to identify the similarity/differences between their sample and samples used in related research. Question 1, therefore, gets you to describe your sample and consider the implications of what you discover about the sample.

a) Use R to compute numerical summaries (i.e. not graphs) that describe the demographic characteristics of your sample; be sure to summarize at least one quantitative variable and at least one categorical variable. Write a short paragraph that describes the sample based on your summaries.

b) Based on the Research Question, the population of interest is students. Using what you learned about the sample from part a and your understanding of the sampling and/or study design used to collect the data, discuss whether you think your sample is a good representation of the population of interest. Be sure to apply your understanding of sampling strategies/designs and concerns, and sampling variability/error in your discussion.

Note: your written answers to parts a and b don’t need to be extensive. A few sentences for each part (e.g. 150 words or less for each of part a and c) should be sufficient.

Biol/Stat 2244A FW21 – Assignment 2

6

Question 2. During the Data stage of PPDAC, it is a good idea to explore your data using a graphical summary. This Assignment question asks you to create a graph to ‘answer’ the Research Question. However, you will first create a new variable using R to be used in your graph.

a) Create a new variable using R that is called psych_rank. This new variable will characterize each individual in the sample as high, moderate, low, or very low with respect to the personality traits associated with psychopathy. Individuals who score high on all three (3) of the psychopathy personality traits (boldness, meanness, and disinhibition) would be ranked as ‘high’ on psychopathy scale. Individuals who score high on two (2) of the three psychopathy personality traits would be ranked as ‘moderate’ while individuals who score high on one (1) of the three psychopathy personality traits would be ranked as ‘low’ on the psychopathy scale. Individuals who score low on all three (3) psychopathy personality traits would be ranked as ‘very low’. For each personality trait, scoring high is defined as:

• meanness score greater than 15

• boldness score greater than 23

• disinhibition score greater than 24

b) Create a SINGLE graph using R (i.e. no multi-pane figures or faceting) that addresses the Research Question, using the psych_rank variable as your measurement of tendency towards psychopathic personality traits. Your graph should demonstrate the characteristics of good figures for text format.

c) Draw a conclusion about the Research Question using your (no formal statistical analysis/inference should be used for this question—you are just interpreting what you see in your graph). State your conclusion based on the graph; provide support for your conclusion by referencing relevant elements/information obtained from the graph you created.

Note: your written answer to part c doesn’t need to be extensive. A few sentences for (e.g. 150 words or less for part c) should generally be sufficient.

Biol/Stat 2244A FW21 – Assignment 2

7

Grading for Assignment 2 Your answers to the Assignment Questions will be graded based on the 4-level rubric given on the next pages, which focuses on your ability to demonstrate mastery of the course-level learning outcomes listed at the start of this instructions file (plus some general completeness and formatting criteria). Additional notes on grading

• Failure to submit the Assignment at all will result in a ‘0’ for the Assignment

• Completion of at least three (3) Assignments and earning at least (50% and/or) all rubric level 2s on at least 3 of the 4 Assignments is part of the ‘Essential Requirements’ to be eligible to earn credit (i.e. 50% or higher as a final course grade) for the course. Failing to meet the Essential Requirements with respect to Assignments will result in a final course grade recorded as 45% (or, your calculated course grade—whichever is lower).

• Late Assignments (i.e. beyond the 12-hour grace period on the official deadline, and without academic accommodation resulting from the submission of a Self-Reported Absence, or, as approved by an Academic Counselor) will be accepted with a late penalty, equivalent to ONE (1) rubric level) per 24 hours or part thereof. That means that your late Assignment will not be accepted beyond three (3) days/72 hours without accommodation. To submit a late Assignment, you will need to contact me through OWL Messages, including your Assignment file(s) as attachments; the time of receipt of the OWL Message in my inbox will be used for determining the total late penalty.

Biol/Stat 2244A FW21 – Assignment 2

8

Learning outcome/

Grading characteristics

Level 4

ALL of the following must be achieved

Level 3

Up to TWO of the following occur:

Level 2

Up to ONE of the following occur, OR,

MORE THAN TWO of the Level 3

characteristics occur

Level 1

ANY of the following occur, AND/OR,

MORE THAN ONE of the Level 2

characteristics occur

Apply vocabulary to

describe statistical

concepts, procedures, and

ideas;

Answer applies statistical vocabulary

(taught in 2244) where the

vocabulary makes sense, and use is

correct in all instances. Sampling

and/or study designs are accurately

identified where mentioned.

Clear attempt to apply statistical

vocabulary where appropriate,

except: minor inaccuracies in

vocabulary occur, or, minor absence

of relevant vocabulary for

concepts/processes that are

obviously being described/invoked.

Vocabulary use and/or answer

suggests misunderstanding or

confusion between/about major

concepts (e.g. type of sampling/study

design; sampling design vs. study

design; sampling frame, population,

vs. sample, etc.), or, multiple

significant inaccuracies in application

of vocabulary.

Little to no attempt to incorporate

statistical vocabulary, and/or,

majority of attempts contain major

errors that suggest a

misunderstanding of concepts,

and/or, vocabulary use is

predominantly through definitions

rather than integration into the

answer.

Question 2 only: Apply

conventional formats for

reporting and interpreting

results of statistical

analyses in

written/graphical form

Format demonstrates a clear

understanding and application of the

characteristics of good figures for in

text format.

Answer demonstrates an

understanding and application of the

characteristics of good figures for in

text format except: minor error in

format or minor absence of relevant

information

Answer demonstrates an incomplete

understanding/application of the

characteristics of good figures for in

text format.

Little to no attempt to apply the

characteristics of good figures for in

text format.

Completeness and format

of answers

Addresses adequately all components

of the Assignment Questions.

Assignment format guidelines

followed.

Addresses Assignment Question, but

misses a minor component of a

Question. Evident that Assignment

format guidelines were followed but

minor error occurs that don’t

significantly impact interpretation of

answers.

Addresses the Assignment Question,

but misses/skips a part of the

question (e.g. part a, part b). Attempt

to follow Assignment format

guidelines is evident, but some

formatting points are missed that

impact grading of answers.

Does not address the Assignment

Question.

Understanding of data

and variables as

described in metadata

file.

Variables selected to answer an

Assignment question demonstrate an

understanding of the dataset as

described in the metadata file, and

their role in the study design.

One or more variables selected to

answer an Assignment question

demonstrate(s) a misunderstanding

of the dataset and/or the variable(s)

role in the study design.

Select summaries based

on research question and

variables

Selected summaries are appropriate

for all instances of use, based on an

accurate understanding of the

Assignment and/or Research

Question being addressed, and the

Selected summaries are appropriate

for all instances of use, based on an

accurate understanding of the

Assignment and/or Research

One or more selected summaries is

inappropriate given an accurate

understanding of the Assignment

and/or Research Questions being

Biol/Stat 2244A FW21 – Assignment 2

9

type of variables being used. Where

more than one summary would

potentially be appropriate, choice

demonstrates a critical understanding

of what is most appropriate given the

data.

Question being addressed, and the

type of variables being used.

addressed, and the type of variables

being used.

Interpret summaries to

identify and/or describe

patterns, trends, and

interesting features in

data

Generated summaries are

represented and interpreted

accurately in all instances.

Generated summaries are

represented and interpreted

accurately but with minor error;

understanding of nature of summary

is evident.

Representation and/or interpretation

of summary suggests significant

misunderstanding of information

represented.

Use R to create and

modify graphical and

numerical summaries of

data

All data transformation, summaries,

and analysis are conducted in R. All R

code is visible in R markdown file and

resulting knitted file. All code is

successful in producing relevant

product.

All data transformation, summaries,

and analysis are conducted in R, but

some code is not successful in

producing relevant product.

R is not used to conduct some/all of

the data transformations, summaries,

and/or analysis.

Biol/Stat 2244A FW21 – Assignment 2

10

Format of Assignment 2 Structure of your Assignment answers Working in R can be challenging—both for you as the student, as well as for the graders trying to interpret your assignment. The following ‘rules’ are set up to help provide consistency/structure across Assignments, and to promote efficiency (i.e. for faster grading). If you don’t understand any of these ‘rules’, ask for clarification! Please follow these guidelines:

✓ Your Assignment should be created in an R markdown file which you will knit to a single .PDF file. Consequently ALL of your answers to the Assignment will be in the R markdown file and resulting knitted file. You may name the file whatever you want, but it’s a good idea to keep the filename fairly simple.

✓ ALL of your code (including any libraries/packages you have loaded, your import of the datafile, etc.) needs to show in your knitted .PDF file. Be aware; if you see “include = FALSE” at the start of an R chunk, switch it to “include = TRUE”.

✓ Make sure that long lines of code are still visible in your knitted PDF file—we need to see your full code so we can evaluate it properly. If your code is running off the page in your knitted file, review the information/suggestions described in Lab 3 / Lesson 3 / Part 2; under that part, there is a subtopic dealing with long # comments and code that runs off the knitted page.

✓ Your answers to Question 1 and Question 2 should be labelled as such AND label parts a, b, c, etc. of your answers, with Question 2 starting on a separate page from Question 1. To make a new page in an R markdown file, leave a blank line in the text section (i.e. not an R chunk). On the next line, write only the following: \newpage Note carefully: the grader reading your answer to Question 1 will not read your answer to Question 2, and vice versa. Do NOT make any assumptions that the grader is familiar with your answer to the other question OR any R code/output you used in the other question. If you want to reuse code/output, then copy it again into the relevant Question.

✓ Use separate R chunks in the .RMD file for each Question (you can use more than one R chunk per Question if it helps with your organization, but definitely have separate R chunks for Question 1 vs. Question 2). Label your questions clearly.

✓ Do not ‘print out’ the dataset or variables in your knitted file (i.e. don’t make it so that the raw data shows in your RMD file). Essentially, your resulting file for submission should only be a couple pages long! That means, if you use View() for example, remove that from your R Markdown file before you knit it. As well, your code used to import the datafile in the R markdown file should simultaneously ‘save’ it to a dataframe (i.e. a2 <- read.csv(file = “assignment.csv”, header = TRUE).

✓ Consider using short #comments in your R chunks to help clarify what you are trying to achieve with your lines of code; this is also good form so that if you look back at your Assignment in the future, you know what your code was doing.

✓ Do NOT include the original Assignment questions in your answer file; just have your answer text in your file. Your Assignment will be submitted to Turnitin; having the Assignment question text in your submission will inflate the observed textual similarity.

Biol/Stat 2244A FW21 – Assignment 2

11

✓ In Questions that involve written responses, do your best to use proper grammar and spelling. Problems with sentence structure and grammar that significantly detract from the readability of your answer (and therefore, our understanding of what you are saying) may result in lower rubric levels.

✓ Write your answer in your own words; your Assignment answers should represent your independent thinking. This Assignment (and the other three Assignments in the course) are individual Assignments. Evidence of inappropriate collaboration in your Assignment answer will be dealt with according to the University’s procedures for issues relating to Academic Integrity. Your Assignment will be submitted to Turnitin for analysis of textual similarity to other sources.

Submission of your Assignment You will need to submit your Assignment in TWO (2) places by the deadline:

1. OWL Assignments, to the same Assignment where you accessed this instruction file. You should submit both your R markdown file (.RMD) AND the resulting knitted PDF file.

2. Gradescope. On Gradescope, submit the resulting knitted PDF file to the assignment called, “Assignment 2 – Data”. When you submit your Assignment to Gradescope, you will need to ‘assign’ the page(s) of your Assignment file that are relevant to Question 1 and to Question 2. If you are uncertain how to do this, watch the video at: https://www.youtube.com/watch?v=u- pK4GzpId0&feature=emb_logo. The relevant information begins at approximately timepoint 1:40.