1 / 8100%
Datac Preparationc andc cleansingc wasc ac multi-stepc process,c requiringc
cleaningc rawc datac toc preparec itc toc bec mergedc together,c asc wellc asc
initialc explorationc toc identifyc thec featuresc toc includec andc anyc datac
noise/missingc datac issues.c Belowc arec thec stepsc Ic tookc onc eachc
individualc setc ofc data,c asc wellc asc whatc wasc neededc oncec thec datac
werec combined.
Stepc 1:
USc Censusc Bureau:c Americanc Communityc Surveyc 5-year
Medicaid/Means-Testedc Publicc Coveragec byc Sexc byc Agec (Tablec
C27007)
County-levelc annualc datac (5-yearc runningc average)c forc allc 50c U.S.c
States,c fromc 2012-2020
Step
Tool
Description
Dropc Unneededc
Variablesc
(Columns)
Excel/
Tablec providedc source,c marginc ofc errorc
forc eachc variable;c thesec columnsc werec
dropped.
Renamec Variablesc
(Columns)
Excel
Originalc variablec namesc werec longc andc
cumbersome;c theyc werec simplifiedc (e.g.c
fromc “Estimate!!Total:!!Male:!!Underc 19c
years:c toc Malec Youthc Pop)
Createc Percentagec
Variablesc
(Columns)
Excel
Createc newc columnsc thatc convertc rawc
populationc numbersc intoc percentagec
sharesc ofc populationc orc subpopulationc
(e.g.c MaleYouthMed_Ratec =c Malec
Youthc Medicaid/Malec Youthc Pop)
Addc Year
Excel
Addedc columnc toc capturec yearc ofc datac
(toc facilitatec mergingc multiplec yearsc intoc
onec table)
Createc Index
Excel
Mergec Yearc withc 5-digitc Countyc FIPSc
codec toc createc ac uniquec identifierc forc
eachc rowc ofc data
Addc Medicaidc
Expansionc Status
Excel
One-to-Manyc mergec ofc State/yearc ofc
expansionc toc eachc County/Statec row
Mergec Annualc
Datac Sets
Excel
Mergec allc yearsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Unionc ofc Datac Sets)
Removec Duplicates
Excel
Addingc Medicaidc Expansionc Statusc stepc
createdc duplicatec recordsc forc Arkansasc
andc Kansas;c thesec werec removed
c
Healthc Insurancec byc Agec andc Racec (TablesC27001B-E,c H)
County-levelc annualc datac (5-yearc runningc average)c forc allc 50c U.S.c
Statesc fromc 2012-2020c forc Americanc Indian/Native,c Asian,c Black,c
Latino/Hispanic,c Nativec Hawaiian/Pacificc Islander,c andc Whitec Non-
Hispanicc sub-populations.c Excludedc ‘somec otherc race’c andc ‘twoc orc
morec races’c c visualc inspectionc suggestedc tooc fewc records.
Step
Tool
Description
Dropc Unneededc
Variablesc
(Columns)
Excel/
Tablec providedc source,c marginc ofc errorc
forc eachc variable;c thesec columnsc werec
dropped.
Renamec Variablesc
(Columns)
Excel
Originalc variablec namesc werec longc andc
cumbersome;c theyc werec simplifiedc (e.g.c
fromc C27001Bc “Marginc ofc
Error!!Total:!!Underc 19c years:c toc
Pop_Youth_Black)
Createc Percentagec
Variablesc
(Columns)
Excel
Createc newc columnsc thatc convertc rawc
populationc numbersc intoc percentagec
sharesc ofc populationc orc subpopulationc
(e.g.c YouthBlackIns_Ratec =c
Youth_Black_Ins/Pop_Youth_Black)
Addc Year
Excel
Addedc columnc toc capturec yearc ofc datac
(toc facilitatec mergingc multiplec yearsc intoc
onec table)
Createc Index
Excel
Mergec Yearc withc 5-digitc Countyc FIPSc
codec toc createc ac uniquec identifierc forc
eachc rowc ofc data
Mergec Annualc
Datac Sets
Excel
Mergec allc yearsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Unionc ofc Datac Sets)
Initialc datac
exploration
R
Revealsc manyc nullc valuesc forc Americanc
Indian,c Asian,c andc Nativec
Hawaiian/Pacificc Islanderc recordsc (morec
thanc halfc arec null)
Deletec sparsec
variables
Excel
Removec columnsc ofc datac forc Asian,c
Americanc Indian,c andc Nativec
Hawaiian/Pacificc Islanderc variables.
Mergec withc
Medicaidc Datasetc
toc createc
Combinedc Dataset
Excel
Mergec allc rowsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Fullc Join)
c
Incomec Datac (Tablec S1903)
County-levelc annualc datac (5-yearc runningc average)c forc allc 50c U.S.c
Statesc fromc 2012-2020
Step
Tool
Description
Dropc Unneededc
Variablesc
(Columns)
Excel/
Retainc onlyc County,c FIPSc ID,c State,c andc
Medianc Householdc Incomec Variable
Renamec Variablec
(Columns)
Excel
From:c
“Estimate!!Number!!HOUSEHOLDc
INCOMEc BYc RACEc ANDc HISPANICc ORc
LATINOc ORIGINc OFc
HOUSEHOLDER!!Households”c toc
Medianc Income
Addc Year
Excel
Addedc columnc toc capturec yearc ofc datac
(toc facilitatec mergingc multiplec yearsc intoc
onec table)
Createc Index
Excel
Mergec Yearc withc 5-digitc Countyc FIPSc
codec toc createc ac uniquec identifierc forc
eachc rowc ofc data
Mergec Annualc
Datac Sets
Excel
Mergec allc yearsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Unionc ofc Datac Sets)
Mergec withc
Combinedc Dataset
Excel
Mergec allc rowsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Fullc Join)
c
Employmentc &c Educationc Datac (Tablec S2301)
County-levelc annualc datac (5-yearc runningc average)c forc allc 50c U.S.c
Statesc fromc 2012-2020
Step
Tool
Description
Dropc Unneededc
Variablesc
(Columns)
Excel/
Retainc onlyc County,c FIPSc ID,c State;c
Populationc 20-64c years,c Laborc Forcec
Participationc Rate,c Unemploymentc
Rate,c Povertyc Status;c Populationc 25-64,c
Educationalc Attainment-Lessc thanc HS,c
HS/Equivalent,c Somec
College/Associates,c Bachelorsc orc higher
Renamec Variablec
(Columns)
Excel
Originalc variablec namesc werec longc andc
cumbersome;c theyc werec simplifiedc (e.g.c
fromc
Estimate!!Laborc Forcec Participationc
Rate!!Populationc 20c toc 64c yearsc toc
LaborForceRate
Addc Year
Excel
Addedc columnc toc capturec yearc ofc datac
(toc facilitatec mergingc multiplec yearsc intoc
onec table)
Createc Index
Excel
Mergec Yearc withc 5-digitc Countyc FIPSc
codec toc createc ac uniquec identifierc forc
eachc rowc ofc data
Mergec Annualc
Datac Sets
Excel
Mergec allc yearsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Unionc ofc Datac Sets)
Mergec withc
Combinedc Dataset
Excel
Mergec allc rowsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Fullc Join)
c
Countyc Healthc Rankings
County-levelc annualc datac (5-yearc runningc average)c forc allc 50c U.S.c
Statesc fromc 2012-2020
Step
Tool
Description
Dropc Unneededc
Variablesc
(Columns)
Excel/
Fullc datasetc containsc 786c variables.c
Retainc onlyc County,c FIPSc ID,c State;c Lifec
Lostc Rate,c Adultc Reportedc Fair/Poorc
Healthc (%),c Avgc Poorc Physicalc Healthc
Days/Month;c Avgc Poorc Mentalc Healthc
Days/Month,c %Lowc Birthc Weightc (livec
births),c Adultc Obesityc Rate,c Sexuallyc
Transmittedc Infectionsc (STIs)c perc 100K,c
Teenc Birthc Rate,c Primaryc Carec
Physiciansc perc 100K,c Preventablec
Hospitalizationsc perc 100Kc Medicarec
Enrollees,c Percentc Childc Poverty,c
Percentc Singlec Parentc Households,c
Violentc Crimec Rate,c Percentc Smokers,c
Percentc Excessc Alcoholc Consumption,c
Fluc Vaccinationc Percent
Renamec Variablec
(Columns)
Excel
Adjustc Variablec Names,c e.g.c fromc “%c
Vaccinatedc toc Fluc Vaccinationc
Percent”
Addc Year
Excel
Addedc columnc toc capturec yearc ofc datac
(toc facilitatec mergingc multiplec yearsc intoc
onec table)
Createc Index
Excel
Mergec Yearc withc 5-digitc Countyc FIPSc
codec toc createc ac uniquec identifierc forc
eachc rowc ofc data
Mergec Annualc
Datac Sets
Excel
Mergec allc yearsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Unionc ofc Datac Sets)
Mergec withc
Combinedc Dataset
Excel
Mergec allc rowsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Fullc Join)
c
Centersc forc Diseasec Control
Diabetesc Atlas:c Diagnosedc Diabetesc Amongc Adultsc 20+c Years,c Age-
Adjustedc Percentage
County-levelc annualc datac (5-yearc runningc average)c forc allc 50c U.S.c
Statesc fromc 2012-2020*
2020c datac takenc fromc Countyc Healthc Rankingsc (notc availablec viac
CDCc Diabetesc Atlas)
Step
Tool
Description
Dropc Unneededc
Variablesc
(Columns)
Excel/
Dropc Statec FIPS
Renamec Variablec
(Columns)
Excel
Changec “Percentagec toc “Diabetesc
Prevalence”
Createc Index
Excel
Mergec Yearc withc 5-digitc Countyc FIPSc
codec toc createc ac uniquec identifierc forc
eachc rowc ofc data
Mergec Annualc
Datac Sets
Excel
Mergec allc yearsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Unionc ofc Datac Sets)
Mergec withc
Combinedc Dataset
Excel
Mergec allc rowsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Fullc Join)
c
COVIDc Datac c USAc Facts
County-levelc dailyc datac fromc 1/1/2019c toc 6/9/22c forc allc 50c U.S.c
States
Step
Tool
Description
Createc Newc Datac
Table
Excel
Summaryc Datac Tablec Shellc forc annualc
data
Addc onlyc year-endc
values
Excel
Copyc datac forc county,c state,c FIPS,c andc
lastc dayc ofc eachc yearc (orc lastc availablec
day)c intoc ac columnc forc casesc orc deathsc
byc thec year
Renamec Variablec
(Columns)
Excel
Datac combinedc fromc 3c sheetsc (deaths,c
cases,c population),c withc columnsc
renamedc withc Deathc orc Cases,c plusc yearc
(2020,c 2021,c 2022c fromc “12/31/2020”,c
“12/31/2021”,c 6/09/22”)
Createc Percentagec
Variablesc
(Columns)
Excel
Createc newc columnsc thatc convertc rawc
populationc numbersc intoc percentagec
sharesc ofc populationc (e.g.,c 2020c Casesc =c
2020c Cases/Pop)
Createc Index
Excel
Mergec Yearc withc 5-digitc Countyc FIPSc
codec toc createc ac uniquec identifierc forc
eachc rowc ofc data
Mergec withc
Combinedc Dataset
Excel
Mergec allc rowsc usingc indexc (ID-Year)c asc
thec primaryc keyc (Leftc Join)
Stepc 2:
Usec Rc toc makec surec allc datac arec inc correctc formatc (e.g.c numeric,c
factor,c etc).
Usec Rc toc generatec summaryc statisticsc andc structurec ofc data,c andc
producec histograms/barc charts/boxc chartsc toc find:
Missing/Nullc Values;c significantc numberc ofc nullsc inc datac byc racec
(Asian,c Americanc Indian,c andc Nativec Hawaiian)c ledc toc focusc onc
Black,c Latino,c andc Whitec populationsc forc healthc insurancec data.
Skewness
Outliers
Usec Rc toc createc correlationc matricesc toc identifyc correlations,c whichc
canc bec usedc toc reducec thec numberc ofc featuresc inc thec model.
Students also viewed