Biostastics Final Project
THE UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT HOUSTON
SCHOOL OF PUBLIC HEALTH: PH1690 Foundations of Biostatistics
Individual Date Analysis Project
Your overall topic is assessing differences in protein intake between subjects with high and normal total cholesterol as well as differences in total fat by how the subjects perceive the healthiness of their diet and the association between total fat and total cholesterol level.
We expect that you will analyze the data on your own and that you write the brief report that you submit to us to evaluate and grade yourself. This is an individual project, and it is NOT a collaborative exercise. In other words, you must work on your own on this project and you are not allowed to share your project, results or electronic files or documentation with anyone. All analyses are to be done in Stata. Your written report should be no longer than 4 pages. Any relevant tables or graphs will be included in an Appendix and referenced in the written portion (there is no page limit for the appendix).
The data that you will use for this project is a simple random sample of n = 3,500 drawn from the N = 4,301 who participated in this survey. You will use the following STATA code to select the observations that you will use for your analyses:
// Select a random sample of 2,500
set seed ######
sample 4,500, count
The random number seed is your student ID number (no characters). It is unique to you. It assures that you have a unique set of data for your project and allows us to verify that you used the data assigned to you for your project. While your dataset is drawn from the original data, associations among the independent and dependent variables you obtain will differ in magnitude and level of statistical significance from those of your fellow students, as well as the original data
Your report will have 4 sections: Introduction, Methods, Results, and Discussion.
Introduction: This section is expecting to answer, “What is the rationale for the scientific question asked?” The rationale needs to be based on the significance of dietary behavior and serum cholesterol levels to public health. Describe the relationships of interest and the purpose of the analysis. Please conduct a small PubMed literature search (1-3 studies) to support the scientific question asked in the project and provide a brief summary of the issues in this section.
Methods: This section should describe what steps and statistical methods you did to analyze the data and how you applied them to solve the questions asked. You need to provide a description of what statistical methods were used and the rationale or purpose for it. Please describe any statistical methods used for testing assumptions of the test if needed. If you created new variables for your analyses, you need provide the rationale for creating the variable and describe the method you used to create the new variable. Add a sentence referencing the software, in this case Stata, you used for all your analyses, just as you are expected to do for any peer review publication.
Results: The results section needs to mimic a peer review publication, so it needs to include the following elements:
Part I: Summary Statistics. Describe the composition of your sample in terms of its sociodemographic make-up,
a) Identify the variables used in the comparison and create a summary table that describes your sample. Use the Table 1 Template in Appendix A to present your descriptive statistics.
· You will need to generate a new categorical variable high_chol defining patients with high total cholesterol (lbxtc >= 240 mg/dL)=2, with Borderline high total cholesterol (200<lbxtc < 240 mg/dL)=1, and with desirable total cholesterol (lbxtc<200)=0.
· You will also generate new categorical variables as needed so that they match the categories of the variable in the given table template (for example: Education)
· Determine summary measures for each one of the variables in the dataset using central and dispersion statistical measures. This is: for categorical variables, report the sample size and proportions for each category of response and for continuous variables, report the mean and standard deviation if the variable is normally distributed or the median and interquartile range if the variable is not normally distributed. Indicate in the table which measures you computed.
· For each set of variables compute the appropriate parametric or nonparametric test statistic for categorical outcomes to assess the simple association between each independent variables and high, borderline high, and normal total cholesterol subjects.
· Describe the statistics you used above in the Methods section and report the P value in the table.
· Provide the p value for the overall chi square or Fisher’s exact test on the line with the category label.
b) Summarize your findings based on the initial descriptive statistics in a brief paragraph.
Part II: Hypothesis testing. Evaluate if there are differences in a) mean total cholesterol levels and perceived healthiness, b) dietary protein intake and perceived healthiness c) total fat intake and perceived the healthiness of their diet (using the variable how healthy is your diet).
· For perceived healthiness, Recode the responses to the question “how healthy is your diet” into 3 categories: Poor/Fair = 1, Good = 2, and Very Good/Excellent = 3.
· Include appropriate graphs for each comparison (include 3 graphs in appendix)
· Assuming that the data are normally distributed, test these hypotheses. Report the p-value for the overall test.
· If the overall test is statistically significant use an appropriate pairwise multiple comparisons procedure to test for differences in mean total cholesterol levels to identify specific differences between levels of perceived diet.
· Summarize your findings in two or three paragraphs.
Part III: Linear Regression. Assess the hypothesis that the total cholesterol level is associated with protein intake and total fat.
· Present your initial exploratory analyses on the data that you used to make a preliminary assessment on the presence of potential outliers and distributional characteristics relevant to the statistical model needed to address the primary hypotheses you are asked to evaluate.
· Describe how you dealt with violations of the assumptions (selecting an appropriate transformation if applicable)
· Fit the regression model.
· Copy and Paste the regression results for each of your final model into your report and label the tables (Table 2, Table 3 etc.)
· Evaluate model fit by examining the residuals. Interpret your findings. Include the graphs as separate figures in in the Appendix.
· Summarize the key results from your regression model and your assessment of model fit in the body of the paper
Discussion: In this section you need to describe what the results mean in the context of the scientific question integrating all the questions asked for the project
Direct any questions about the project to your instructor or the TAs assigned to your class.
SUBMISSION
Once you have completed all your analyses and written your report, you will upload the report and a do file that documents all your analyses and analytic decisions to Canvas. For simplicity (and legibility), please use either Times New Roman, Calibri or Arial font with a minimum font size of 11 and have a minimum 1.5 spacing. You will submit either a .doc or .pdf file.
1. Save your answers under the name LastName_FirstName_ProjectReport.docx.
2. Save your do file for Part 1 under LastName_FirstName_Project.do
3. Upload your completed assignment and do file no later than 11:59 PM (CDT), Friday, December 11, 2020.
PROBLEMS AND GRADING: The following table outlines the content of your report and the points assigned to each section.
|
Section on Report |
Points |
|
Introduction |
5 |
|
Methods |
10 |
|
Results (Tables and Graphs with supporting text) |
|
|
Part I. Summary Statistics (Table and Text) |
20 |
|
Part II. Association between healthiness of diet (variable high_chol) including a discussion on choosing the proper test |
|
|
Gender |
6 |
|
Race |
6 |
|
Education |
6 |
|
How Healthy is your diet? |
6 |
|
Have you heard of Food/My Pyramid |
6 |
|
Part III. Association between total cholesterol and dietary protein and total fat intake |
|
|
Describe Exploratory Analyses and Graphics |
5 |
|
Discussion of Issues and Transformations |
5 |
|
Regression model |
10 |
|
Regression Diagnostics (Graphs and Written Analysis) |
10 |
|
Discussion |
5 |
|
Total |
|
|
|
|
|
|
100 |
Appendix A. Model Table 1. Descriptive Statistics
|
|
Desirable Total Cholesterol (lbtx<200mg/dlg, N = ) |
Borderline high Total Cholesterol (200<lbxtc < 240 mg/dLg, N =) |
High Total Cholesterol (lbtx≥240mg/dl,N =) |
|
|
|
|
|
|
Age (in years, M (SD)) |
Descriptive statistics of variable age for patients with desirable total cholesterol |
|
|
|
Gender |
|
|
|
|
Male (%(n)) |
Number of male patients with desirable total cholesterol |
|
|
|
Female (%(n)) |
|
|
|
|
Race |
|
|
|
|
Hispanic White (% (n)) |
|
|
|
|
Non-Hispanic White (% (n)) |
|
|
|
|
Non-Hispanic Black (% (n)) |
|
|
|
|
Other (% (n)) |
|
|
|
|
Education |
|
|
|
|
Less than High School (% (n)) |
|
|
|
|
High School Grad (% (n)) |
|
|
|
|
College (% (n)) |
|
|
|
|
How Healthy is your diet? |
|
|
|
|
Excellent/V.good (% (n)) |
|
|
|
|
Good (% (n)) |
|
|
|
|
Fair/Poor (% (n)) |
|
|
|
|
Have you heard of Food/My Pyramid |
|
|
|
|
Yes (% (n)) |
|
|
|
|
No (% (n)) |
|
|
|
|
|
|
|
|
|
Total # of Dietary Supplements Taken in past 30 days (M (SD)) |
|
|
|
|
Total energy(Kcal) intake in 24hrs (M (SD)) |
|
|
|
|
Total protein(gm) intake in 24hrs (M (SD)) |
|
|
|
|
Total carbohydrate (gm) intake in 24hrs (M (SD)) |
|
|
|
|
Total sugar(gm) intake in 24hrs (M (SD)) |
|
|
|
|
Total dietary fiber(gm) intake in 24hrs(M (SD)) |
|
|
|
|
Total fat(gm) intake in 24hrs (M (SD)) |
|
|
|
**In the green cells, you just need to provide the corresponding number of patients. In other words, these cells should be filled as contingency table.
Model Table 2. Association between
|
|
P-value |
|
|
|
|
Gender |
0.xxxa |
|
Race |
0.xxxb |
|
Education |
|
|
How Healthy is your diet? |
|
|
Have you heard of Food/My Pyramid |
|
a- P-values from --- test
b- P-values from chi2 test
c- P-values from --- test
4