Outline
Biostatistics
DH 242 Dental Public Health
1
RESEARCH
SCIENTIFIC METHOD: a series of logical steps starting with the formulation of a problem
Formulation of a problem (question)
Formulation of a hypothesis ( a proposed answer to the question)
Collecting data (existing as well as gathering your own
Analysis and interpretation of the results
Presentation of the results
Formulation of a conclusion (relationship of results to hypothesis
2
Data
pieces of information
e.g., numbers, collected from measurements and counts obtained during the course of a research study
3
Collecting Data
Review relevant available literature.
Design research…determine how the study will be conducted; how the data will be collected using various data collection methods.
Instruments and examiners for data collection must be calibrated…both valid and reliable
Validity: concerned with gathering data that have been intended to be collected. Reliability: refers the consistency and stability of the data. The data are reliable if the examiners are calibrated and can reproduce the results.
4
STATISTICS AND BIOSTATISTICS
Statistics is the science of making statements about an entire population from a limited sample of that population. It involves analyzing data and drawing conclusions, taking variation and uncertainty into account.
Biostatistics is simply the application of these methods in biologically relevant areas. The appropriate use and interpretation of biostatistical measures and tests are essential to every stage of a dental public health initiative. To define a problem in a community, you first must quantify it using descriptive statistics and measures of disease.
5
Biostatistics
The use of data analysis and interpretation in care research
Most often computer programs are used to do all of the computations.
Data Analysis: Two Steps
The first is to calculate descriptive statistics, the characteristics of the data found within the sample of individuals in whom the study was conducted.
The second step is to calculate inferential statistics. The purpose of generating inferential statistics is to determine whether the results found in the sample may be a result of chance or, assuming no other threats to validity, whether we can generalize our results to the general population of interest.
7
Biostatistics – Data Analysis
Involves the application of statistical tests to the data in order to organize, describe, summarize, and analyze it to answer a research question or test a hypothesis
Explains results, requires that critical thinking be used to explain the meaning and application of the findings, identifies possible factors that could have influenced the results, and draws inferences to the population.
Use of Biostatistics in Dental Hygiene
Used to demonstrate response to dental hygiene therapy
Tests products and treatment regimens used in dental hygiene therapy
Determines the needs of target populations
Evaluates oral health treatment
Prevention of dental disease
Education programs
Variety of other purposes in relation to oral health care
Dental Hygienists Should Understand the Research Process, Including
Data analysis
Interpretation
Critical analysis results
Dental Hygienists Should Understand the Research Process In Order To:
Understand the epidemiology of disease.
Practice therapies.
Implement programs.
Practice evidence-based dentistry.
Causes of Invalid Research
Insufficient number of subjects
Too short of a duration
Incorrect measurement instruments
Incorrect procedures are utilized
Incorrect statistical tests are used to analyze data
Categorizing Data
Quantitative Data
Qualitative Data
Continuous Variable
Discrete Variable
Categorical Variable
Dichotomous Variable
Nominal Scale
Ordinal Scale
Interval Scale
Ratio Scale
Quantitative Data
Represented by numbers
Expressed as counts, percentages, and means
Qualitative Data
Information that reflects the quality or nature of variables that cannot be expressed numerically
Expressed as outcomes, or states, and can be counted for reporting
Variables can be rank ordered
Types of Data
Ways to categorize data
Continuous (Data) Variable
Made up of distinct and separate units or categories
Expressed as large or infinite number of measures along a continuum
Expressed in fractions and are considered quantitative
Can be converted into nominal or ordinal scales
Discrete (Data) Variable
Made up of distinct and separate units or categories, but is counted only in whole numbers
Also quantitative because it is represented numerically
Can be converted to nominal or ordinal scales
Categorical (Data)Variable
A variable that has no numeric representation
Dichotomous (Data) Variable
Categorical variable that places subjects into only two groups
Categorical and dichotomous variables are qualitative in nature
20
Scales of Measurement
21
Nominal Scale
Organizes data into mutually exclusive categories
Categories have no rank order or value
No numeric relationship between the different classifications
Ordinal Scale
Organizes data into mutually exclusive categories that are rank ordered based on criterion
Difference in ranks is not equal in value
.
Interval Scale
Characteristics of the ordinal scale plus it has equal distance between any two adjacent units of measurement
No meaningful zero point
Ratio Scale
Scale of measurement that contains all the characteristics of the preceding scales
Has an absolute zero point
Scales of Measurement
Different scales of measurement are used for discrete and continuous data.
Discrete: use nominal and ordinal
Continuous: use interval and ratio
26
Degrees of Freedom
Also known as “df”
Refers to the number of values or observations that are free to vary when computing a statistic
Represents the number of measurements taken, minus one for each population
The number is necessary to interpret inferential statistical tests.
It is based on sample size, so the larger the (df), the easier it is to obtain a statistically significant result.
DATA ANALYSIS AND PRESENTATION OF RESULTS
Statistical analysis makes an assumption about a population.
Two types:
28
Descriptive Statistics
Consists of the procedures that are used to summarize, organize, and describe quantitative data
Described with the use of tables and graphs
Inferential Statistics
Used to make inferences or generalizations about a population based on data taken from a sample of that population
“Making statistical decisions”
Inferential Statistics
INFERENTIAL STATISTICS
Seek to determine a generalization between the sample studied and the actual population
May be either parametric or nonparametric statistical techniques
Based on the assumption that sampling is randomly collected
32
INFERENTIAL STATISTICS
STATISTICAL SIGNIFICANCE
Indicates whether the results found in an analysis of data have occurred by chance or have been caused by the independent variable.
May be influenced by a sample size that is too small (less than 30). Not enough information has been provided to make generalizations about the populations
33
Confidence Intervals
A confidence interval is a statistical technique used to infer the true value of an unknown population parameter.
Typically, 95% and 99% confidence intervals are used.
The use of a 95% confidence interval is acceptable in oral health research.
HYPOTHESIS TESTING
The second approach to statistical inference is hypothesis testing.
The goal of hypothesis testing is to judge the evidence for a hypothesis.
Hypothesis testing can be divided into four discrete steps: (i) formally stating the null and alternative hypotheses; (ii) choosing an appropriate statistical test; (iii) conducting the statistical test to obtain a p-value; and (iv) comparing the p-value against a fixed cutoff for statistical significance—α (alpha).
Typically, this value is set to 0.05. Essential to the concept of hypothesis testing is the p-value. The objective of hypothesis testing is to formally weigh the evidence against a null hypothesis.
P-value
The p-value is a probability value.
It represents the probability that the findings from the study are due to chance.
The p-value commonly accepted in oral health research is equal to or smaller than 0.05.
If larger than 0.05, the results are said to be not statistically significant.
Power Analysis
Used to determine how many subjects are needed to provide significance in a research study
Calculated by using a statistical formula
“Power of a study” refers to its ability to detect relationships among variables
Is directly related to the sample size and the precision in planning and conducting the study
Importance of statistical significance the greater the significance, the more statistical inference can be made regarding the study population
37
Formulation of a Conclusion/Relationship of Results to the Hypothesis
Determination of whether the study shows significance
The researcher will either accept or reject the null hypothesis. He/she may make an error in this conclusion. There are two types of errors.
Type I alpha (α) occurs when the researcher rejects the null hypothesis when it is actually true states that a relationship exists between the variables when there is none
Type II (β) occurs when the researcher accepts the null hypothesis when it is actually false states that no relationship exists between the variables when one actually does.
38
Null Hypothesis
Is an initial negative statement of belief about the value of a population parameter
Null hypothesis is accepted unless the statistical test indicates it should be rejected
Example:
Two groups do not differ on a variable
Research Hypothesis
Called the alternative or positive hypothesis
Is the logical opposite and can indicate a direction of difference
Example:
One brand of sealants does differ from another brand of sealants
Statistical Decision
Made about the null hypothesis based on the results of inferential statistics
Decision to reject or accept the null hypothesis is based on probability at a significance level
Type I Error
A type I error is also called an alpha error.
It occurs when the null hypothesis is rejected and is actually true so it should have been accepted.
The probability of computing a type I error is the same as at the alpha level.
Researchers can control a type I error by setting the alpha level low.
This type of error can be very costly.
Type II Error
A type II error is also called a beta error.
It occurs when the null hypothesis is accepted, but it is actually false, so it should have been rejected.
The exact probability of computing a type II error is generally unknown.
They are caused by using too small a sample, unreliable measuring devices, or imprecise research methods.
Parametric Inferential Statistics
Used for hypothesis testing when the data meet certain assumptions
Data must be classified as continuous (includes ratio, interval, and ordinal data)
Types of parametric statistics:
Student t-test
Analysis of variance (ANOVA)
Student t-test
Used to compare two mean scores to determine if there is a statistically significant difference
Two types of t-tests:
T-test for independent samples (nonpaired t-test)
T-test for correlated samples (t-test for paired samples)
ANOVA
Used to determine if statistically significant differences occur when comparing more than two mean scores
Data are presented in complex tables
Nonparametric Inferential Statistics
Most useful for data measured at the nominal or ordinal scale which are qualitative
Nonparametric tests involve fewer assumptions about the population.
Sample size may be small; variables are discrete
Chi-Squared Test: most commonly used; used to analyze questionnaire data and to determine whether a relationship exists between two variables
48
DATA ANALYSIS AND PRESENTATION OF RESULTS
49
Measures of Central Tendency
Mean: average; used with continuous ordinal data
Median: midpoint of the data; used with ratio, interval, or ordinal data
Mode: value that occurs most often; used with all types of data
MEASURES OF CENTRAL TENDENCY
MEAN
The average of the group; sum of all the values divided by the number (n) of items
Disadvantage: extreme scores may distort the true average or representation
MEDIAN
The exact middle score in an ordered distribution of scores; the point above and below which 50% of the scores lie
Disadvantage: may not reflect a true midpoint if scores are not evenly distributed
MODE
The score that appears most frequently in a distribution of scores; may be unimodal, bimodal, multimodal, or no mode
Measures of Dispersion
Communicates how much variation is present in a group of data
“Measure of variability”
Describes distribution of data within a research study
Range
Variance
Standard deviation
Measures of Dispersion – Range
Determined by subtracting the lowest score from the highest score
Simplest and least helpful measurement
Usually reported with the median
Measures of Dispersion – Variance
Represents the average distance of each score from the mean
Standard deviation (SD) is associated with range
The most common and useful measures of dispersion
Usually reported with the mean to calculate data intervals
Value of the variance or the SD in relation to the mean depicts the distribution of scores
Descriptive Statistics
57
The Normal Distribution
Forms the theoretical foundation for comparisons and making statistical decisions
A symmetrical, unimodal, bell-shaped curve
Explains why random variables tend to be normally distributed
Mean, median, and mode equal in value
The Normal Distribution – Empirical Rule
Provides an estimation of the spread of data given the mean and the standard deviation of a data set that follows the standard normal distribution
68% of data fall within one SD of the mean, 95% within two SD, and 99.7% within three SD
FIGURE 18-1 Standard Normal Distribution
FIGURE 18-2 Empirical Rule of the Normal Distribution Source: Darby ML, Bowen DM. Research Methods for Oral Health Professionals: An Introduction. St. Louis, MO: C. V. Mosby, 1980. Reprint. Pocatello, ID: McCann; 1993.
The Normal Distribution - Central Limit Theorem
Normal distribution is the foundation of the central limit theorem.
Less sampling error will occur with a larger sample, and a sample size of 30 or more will estimate the population mean with reasonable accuracy.
.
The Normal Distribution – Standard Error of the Mean
The standard deviation of the sample means
Indicates that each sample mean is likely to vary somewhat from the population mean
Larger sample size significantly reduces the standard error
Skewed Distribution
When a distribution of scores is asymmetrical, the curve is said to be distorted or skewed.
Skewing is caused by a few extreme scores in the distribution.
It can be identified by comparing the mean and median of the distribution.
Positively or negatively skewed
Median and mode more accurately represent central tendency in a skewed distribution
May result from using small or homogenous samples, or failing to use random sampling or random assignment techniques
FIGURE 18-3 Skewed Distributions
Advantages of Graphing Data
Effective and economic communication of data
Easier and quicker understanding and interpretation of data
The ability to compare multiple distributions visually
Frequency Distribution Tables
Frequency distribution tables are used to present data in a way that shows the number of times each score occurs in the group of scores.
Distribution tables can be either grouped or ungrouped.
Data can be displayed in a graph
Facilitates our understanding and interpretation of data
Data presented should be understandable even without written explanation
Types of Graphs
Bar graph
Histogram
Frequency polygon
Polygon
Scattergram
Pie chart
Bar Graph
Used to represent categorical data
Spaces separate bars to emphasize the discrete nature of the variable
Length of the bar corresponds with the frequency of the value
Cluster bar graph can also be created
FIGURE 18-4 Bar Graph of Reasons for Missed Clinic Appointments
FIGURE 18-5 Cluster Bar Graph
Histogram
Similar to a bar graph but the bars appear side by side (touching)
Used for interval or ratio variables
Used to represent grouped and ungrouped frequencies
Used for ordinal data that is treated as continuous data
FIGURE 18-6 Histogram
Frequency Polygon
A line graph that represents frequency data that are continuous in nature
Drawn by connecting midpoints of the bars of a histogram, then extending the line at both ends to imaginary midpoints at the right and left of the histogram
Used to represent grouped or ungrouped frequencies
Can also represent frequency, percent, cumulative frequency, or cumulative percent
FIGURE 18-7 Frequency Polygon Comparing Two Distributions
Polygon
Line graph
Used to plot a variable over time
FIGURE 18-8 Polygon
Scattergram
Shows the relationship between two variables
Shows how the level of one variable varies as the level of the other variable changes
FIGURE 18-9 Scattergrams Demonstrating Relationships of Data
FIGURE 18-9 (continued) Scattergrams Demonstrating Relationships of Data
FIGURE 18-9 (continued) Scattergrams Demonstrating Relationships of Data
FIGURE 18-9 (continued) Scattergrams Demonstrating Relationships of Data
FIGURE 18-9 (continued) Scattergrams Demonstrating Relationships of Data
FIGURE 18-9 (continued) Scattergrams Demonstrating Relationships of Data
Pie Chart
Represents parts of a whole
More acceptable with lay audiences then scientific or technical publications and presentations
Percentage represented by each part of the pie should be labeled for clarity
Correlation
Correlation studies relationships between variables.
The term means relationship or association between variables that can be measured mathematically.
(+/-) determines the direction of the relationship.
“r” signifies the correlation.
Value of “r” communicates the direction and strength of the association
Correlation does not equal causality
Provides much of the evidence in oral epidemiology
Establishes risk
Correlation Techniques
Pearson product-moment correlation coefficient
Spearman rank-order correlation Coefficient
Regression analysis
Multiple regression analysis
Pearson Product-Moment Correlation Coefficient
Most common correlation coefficient
Used when both variables are continuous, interval scaled, and have linear relationship
Spearman Rank-Order Correlation Coefficient
Also called “Spearman rho”
Used to correlate two ordinal variables
Regression Analysis
Can be used to quantify the relationship of two variables
Expresses the functional relationship between the variables
Used to predict the score of one variable based on the score of another
Example:
National board scores
Multiple Regression Analysis
Provides a mathematical model that gives the strength or ability of two or more variables to predict another variable
Examples:
SAT scores
GPA strength
CORRELATION
A statistical method to determine certain relationships between
Results of a correlation show either negative or positive relationship
i.e., if positive, as the valuable of one variable increases, the other also increases.
Perfect or complete correlation has a value of 1.0.
Negative correlation shows an inverse relationship between variables.
Perfect negative correlation is shown by -1.0.
OR no relationship at all = 0.0.
The closer the relationship is to +1.0 or –1.0, the stronger the correlation
Source:
Mason, Jill, (,2010) Concepts in Dental Public Heatlh. (Lippincott Williams & Wilkins)
Nathe, Christine Nielsen. Dental Public Health and Research: Contemporary Practice for the Dental Hygienist. Upper Saddle River, NJ: Pearson, 2011.