R language
INFO-B518-36882
Mid-Term Exam
October 23rd, 2020
OPEN ON October 23rd, 3:30 pm and Due on October 24th, 3:30 pm
Total Points: 50
Images are from Different Resources
ALL QUESTIONS ARE COMPULSORY
Question 1: True or False [5 points]
(i) A Wilcoxon matched -pairs test is used when there are two matched pairs.
(ii) The paired t-test is best used when the measurement scale of the characteristic of interest is interval or ratio.
(iii) Median is not used for non-parametric test.
(iv) If the variances of the two groups being compared are significantly different, then we will always use the independent samples t-test for pooled variances.
(v) A researcher is interested in understanding the following question: Do children make more visit to a doctor’s office then adults. For this study the researcher should always use two Independent variables.
Question 2: Identify the TEST for the below scenario and explain why you will choose the test
for the analysis: [10 points]
(a) In an exercise program there are a total 90 women enrolled. The women are in the two age groups (i) 18 to 45 years and (ii) 45 years above. Which test will you use to analyze the weight loss of the women in the two age groups in this exercise program?
(b) A researcher is interested to study the behavior of twins in a day care center. For all the twins in the program: One twin was given toys to play and the other twin was given books to read. Which test would you use to understand the behavior of the twins with respect to calmness after one week in the program? There were 15 twins enrolled in the program.
(c) You are interested to understand the yearly patients visit to a dentist in a given dental practice. There are total 500 patients visiting the dentist yearly and they are enrolled in either Insurance A or Insurance B. Which test will you use to understand the patients visit to the dentist? Assume the data was not normally distributed.
(d) In a hospital for one year the comorbidities of alcoholic and non-alcoholic patients were studied. Which test will be used to understand that alcoholic patients are more susceptible to comorbidities?
(e) A researcher is interested to know which parameters in COVID-19 are important to classify patients with respect to disease severity. For this the researcher downloads the patient’s – lab values-demography data. Which test should the researcher implement to find the parameters that associate strongly with COVID disease severity?
Question 3: Define any FIVE terms from the given terms: [5 points]
(a) Big Data
(b) Retrospective Study
(c) Systematic Variation Study
(d) Stratified Sample
(e) Sampling with Replacement
(f) Skewness
(g) Addition Rule of Probability
(h) Sample Space
(i) Sampling Distribution
(j) Significance Level
Question 4: Solve any FIVE from the given questions (a to g): [10 points]
(a) For each of the following variables (1 to 5) select the correct measurement scale:
Nominal, Ratio, Interval, Ordinal
(i) Age in years
(ii) Birth Order
(iii) Marital Status
(iv) Number of years spent in College
(v) The number of miles joggers run per week
(b) Give one word for the following:
(i) A pair of variables related to each other are known as:
(ii) A study design that randomly assigns participants into an experimental group or a control group:
(iii) Two outcomes that cannot both happen together are known as:
(iv) To understand a sample with less than 30 observations which table will you use:
(v) When you reject Null Hypothesis even though it is true, this is known as:
(c) A student scores 25 in a test. the class of 10 students with mean for the test was 21 and the standard deviation 5. What was the student’s Z score?
(d) The average age of patients in a clinical trial was 40 years. The Standard Error was 0.90 years. What is the approximate 95% Confidence Interval for the average age of the patient’s in the clinical trial?
(e) Given below is the summary statistics for the cost of a drug purchased by patients from two different stores. Set up and implement a hypothesis test to determine whether on average there is a difference between the drug prices from the two stores. Assume p-value < 0.05, when Z-score is > 5. (assume paired test & actual mean under NULL condition is 0).
|
|
|
|
|
73 |
12.76 |
14.26 |
(f) Given is the outcome of an exercise program.
|
|
|
Outcome |
Total |
|
|
|
|
Weight Loss |
No-Weight Loss |
|
|
Exercise |
Female |
28 |
48 |
|
|
|
Male |
10 |
114 |
|
|
|
|
38 |
162 |
|
Use Chi-square analysis to understand Male Weight Loss by joining the Exercise Program. Assume the Critical Value of 3.84.
(g)
For each of the following figures. ((i) to (vi)) select the figure types:
Pearson Correlation Coefficient, bar plot, stem and leaf, Regression, histogram, Normal Distribution, Box plot)
(i)
(ii)
(iii)
( iv)
(vi)
Question 5: Solve both the questions: [10 points]
A. A researcher wants to understand the depression scores reported by patients enrolled in two drug trials. The levels of depression were measured next day and midweek after the drug consumption. Which test the researcher should use to understand the effect of the drug on the days? (use Data1.csv)
B. (i) Given is the study table tabulated by researcher for patients on a new drug. Compute the odds ratio for survival.
|
|
Treatments |
Odds Ratio |
|
|
|
Drug1 |
Drug2 |
|
|
Survived |
250 |
100 |
|
|
Died |
150 |
20 |
|
|
|
|
|
|
Question 6: [10 points]
Indiana State is interested in understanding the COVID-19 patients’ with respect to its Severity, Mortality, Comorbidities and parameters from March to September 2020. They have access to Patient data from different Hospitals and testing centers. They have arranged the data in four data tables (Table1, Table2, Table3, Table4).
Assume you are a Lead Data Scientist in the State. The State Chief Medical Officer contacts you with request to help them with the analysis.
For this work you hire 3 Master’s students. Using the help of these students design the methodology that can help the STATE to understand the COVID-19 patients’ in regards to its Severity, Mortality, Comorbidities and Parameters.
Table 1: Patient ID, Gender, Birth Date, Race, Parents Alive, Siblings, Education, Income, Alcoholic/Non-alcoholic, Smoker/Non-Smoker, County, Children going to school, Home Zipcode, Date Tested for COVID, Survival
Table 2: Patient ID, Oxygen Level, Blood Pressure, Glucose, HbA1C, Basophil Count, Neutrophil Count, Monocyte Count, Albumin, CRP, Protein, Creatinine, eGFR, Pulse, Cholesterol, Weight, Height, Hgb, Lymphocyte Count, Co2 level, Albumin
Table 3: Patient ID, Type 2 Diabetes Diagnosed date, Cancer Diagnosed Date- Cancer Name, Autoimmune Disease Diagnosed Date- Autoimmune Disease Name, Neurodegenerative Disease Diagnosed Date- Disease Name, COPD diagnosed Date, Other Disease
Table 4: Patient ID, Restaurant Last visited- Name/Zipcode of Restaurant, Living Near Highway, Work from home, Going to Work-Zipcode, Stay at Home Order date start, Stay at Home Order date end, Stage of Lock down, Park visited day- Name/Zipcode of park, Grocery Store visited- Name/Zipcode, Gas Pump Visit Date- Zipcode.
You have to write the process of analysis in detail. Refer to the flow diagrams taught in the class for each test. There is no data to analyze.
HINT: This is an open-ended task- Design your question/questions, hypothesis/hypotheses, data exploration/imputation, Test to be carried out, Conclusion.