1 / 6100%
Datai Preparationi andi cleansingi wasi ai multi-stepi process,i requiringi cleaningi
rawi datai toi preparei iti toi bei mergedi together,i asi welli asi initiali explorationi toi
identifyi thei featuresi toi includei andi anyi datai noise/missingi datai issues.i Belowi
arei thei stepsi Ii tooki oni eachi individuali seti ofi data,i asi welli asi whati wasi neededi
oncei thei datai werei combined.
Stepi 1:
USi Censusi Bureau:i Americani Communityi Surveyi 5-year
Medicaid/Means-Testedi Publici Coveragei byi Sexi byi Agei (Tablei C27007)
County-leveli annuali datai (5-yeari runningi average)i fori alli 50i U.S.i States,i fromi
2012-2020
Step
Tool
Description
Dropi Unneededi Variablesi
(Columns)
Excel/
Tablei providedi source,i margini ofi errori fori eachi
variable;i thesei columnsi werei dropped.
Renamei Variablesi
(Columns)
Excel
Originali variablei namesi werei longi andi cumbersome;i
theyi werei simplifiedi (e.g.i fromi
“Estimate!!Total:!!Male:!!Underi 19i years:”i toi Malei
Youthi Pop)
Createi Percentagei
Variablesi (Columns)
Excel
Createi newi columnsi thati converti rawi populationi
numbersi intoi percentagei sharesi ofi populationi ori
subpopulationi (e.g.i MaleYouthMed_Ratei =i Malei
Youthi Medicaid/Malei Youthi Pop)
Addi Year
Excel
Addedi columni toi capturei yeari ofi datai (toi facilitatei
mergingi multiplei yearsi intoi onei table)
Createi Index
Excel
Mergei Yeari withi 5-digiti Countyi FIPSi codei toi createi ai
uniquei identifieri fori eachi rowi ofi data
Addi Medicaidi Expansioni
Status
Excel
One-to-Manyi mergei ofi State/yeari ofi expansioni toi
eachi County/Statei row
Mergei Annuali Datai Sets
Excel
Mergei alli yearsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Unioni ofi Datai Sets)
Removei Duplicates
Excel
Addingi Medicaidi Expansioni Statusi stepi createdi
duplicatei recordsi fori Arkansasi andi Kansas;i thesei werei
removed
i
Healthi Insurancei byi Agei andi Racei (TablesC27001B-E,i H)
County-leveli annuali datai (5-yeari runningi average)i fori alli 50i U.S.i Statesi fromi
2012-2020i fori Americani Indian/Native,i Asian,i Black,i Latino/Hispanic,i Nativei
Hawaiian/Pacifici Islander,i andi Whitei Non-Hispanici sub-populations.i
Excludedi ‘somei otheri race’i andi ‘twoi ori morei races’i i visuali inspectioni
suggestedi tooi fewi records.
Step
Tool
Description
Dropi Unneededi Variablesi
(Columns)
Excel/
Tablei providedi source,i margini ofi errori fori eachi
variable;i thesei columnsi werei dropped.
Renamei Variablesi
(Columns)
Excel
Originali variablei namesi werei longi andi cumbersome;i
theyi werei simplifiedi (e.g.i fromi C27001Bi “Margini ofi
Error!!Total:!!Underi 19i years:”i toi Pop_Youth_Black)
Createi Percentagei
Variablesi (Columns)
Excel
Createi newi columnsi thati converti rawi populationi
numbersi intoi percentagei sharesi ofi populationi ori
subpopulationi (e.g.i YouthBlackIns_Ratei =i
Youth_Black_Ins/Pop_Youth_Black)
Addi Year
Excel
Addedi columni toi capturei yeari ofi datai (toi facilitatei
mergingi multiplei yearsi intoi onei table)
Createi Index
Excel
Mergei Yeari withi 5-digiti Countyi FIPSi codei toi createi ai
uniquei identifieri fori eachi rowi ofi data
Mergei Annuali Datai Sets
Excel
Mergei alli yearsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Unioni ofi Datai Sets)
Initiali datai exploration
R
Revealsi manyi nulli valuesi fori Americani Indian,i Asian,i
andi Nativei Hawaiian/Pacifici Islanderi recordsi (morei
thani halfi arei null)
Deletei sparsei variables
Excel
Removei columnsi ofi datai fori Asian,i Americani Indian,i
andi Nativei Hawaiian/Pacifici Islanderi variables.
Mergei withi Medicaidi
Dataseti toi createi
Combinedi Dataset
Excel
Mergei alli rowsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Fulli Join)
i
Incomei Datai (Tablei S1903)
County-leveli annuali datai (5-yeari runningi average)i fori alli 50i U.S.i Statesi fromi
2012-2020
Step
Tool
Description
Dropi Unneededi Variablesi
(Columns)
Excel/
Retaini onlyi County,i FIPSi ID,i State,i andi Mediani
Householdi Incomei Variable
Renamei Variablei
(Columns)
Excel
From:i “Estimate!!Number!!HOUSEHOLDi INCOMEi
BYi RACEi ANDi HISPANICi ORi LATINOi ORIGINi
OFi HOUSEHOLDER!!Households”i toi Mediani
Income
Addi Year
Excel
Addedi columni toi capturei yeari ofi datai (toi facilitatei
mergingi multiplei yearsi intoi onei table)
Createi Index
Excel
Mergei Yeari withi 5-digiti Countyi FIPSi codei toi createi ai
uniquei identifieri fori eachi rowi ofi data
Mergei Annuali Datai Sets
Excel
Mergei alli yearsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Unioni ofi Datai Sets)
Mergei withi Combinedi
Dataset
Excel
Mergei alli rowsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Fulli Join)
i
Employmenti &i Educationi Datai (Tablei S2301)
County-leveli annuali datai (5-yeari runningi average)i fori alli 50i U.S.i Statesi fromi
2012-2020
Step
Tool
Description
Dropi Unneededi Variablesi
(Columns)
Excel/
Retaini onlyi County,i FIPSi ID,i State;i Populationi 20-64i
years,i Labori Forcei Participationi Rate,i Unemploymenti
Rate,i Povertyi Status;i Populationi 25-64,i Educationali
Attainment-Lessi thani HS,i HS/Equivalent,i Somei
College/Associates,i Bachelorsi ori higher
Renamei Variablei
(Columns)
Excel
Originali variablei namesi werei longi andi cumbersome;i
theyi werei simplifiedi (e.g.i fromi
Estimate!!Labori Forcei Participationi Rate!!Populationi
20i toi 64i years”i toi LaborForceRate
Addi Year
Excel
Addedi columni toi capturei yeari ofi datai (toi facilitatei
mergingi multiplei yearsi intoi onei table)
Createi Index
Excel
Mergei Yeari withi 5-digiti Countyi FIPSi codei toi createi ai
uniquei identifieri fori eachi rowi ofi data
Mergei Annuali Datai Sets
Excel
Mergei alli yearsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Unioni ofi Datai Sets)
Mergei withi Combinedi
Dataset
Excel
Mergei alli rowsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Fulli Join)
i
Countyi Healthi Rankings
County-leveli annuali datai (5-yeari runningi average)i fori alli 50i U.S.i Statesi fromi
2012-2020
Step
Tool
Description
Dropi Unneededi Variablesi
(Columns)
Excel/
Fulli dataseti containsi 786i variables.i Retaini onlyi
County,i FIPSi ID,i State;i Lifei Losti Rate,i Adulti
Reportedi Fair/Poori Healthi (%),i Avgi Poori Physicali
Healthi Days/Month;i Avgi Poori Mentali Healthi
Days/Month,i %Lowi Birthi Weighti (livei births),i Adulti
Obesityi Rate,i Sexuallyi Transmittedi Infectionsi (STIs)i
peri 100K,i Teeni Birthi Rate,i Primaryi Carei Physiciansi
peri 100K,i Preventablei Hospitalizationsi peri 100Ki
Medicarei Enrollees,i Percenti Childi Poverty,i Percenti
Singlei Parenti Households,i Violenti Crimei Rate,i
Percenti Smokers,i Percenti Excessi Alcoholi
Consumption,i Flui Vaccinationi Percent
Renamei Variablei
(Columns)
Excel
Adjusti Variablei Names,i e.g.i fromi “%i Vaccinated”i toi
“Flui Vaccinationi Percent”
Addi Year
Excel
Addedi columni toi capturei yeari ofi datai (toi facilitatei
mergingi multiplei yearsi intoi onei table)
Createi Index
Excel
Mergei Yeari withi 5-digiti Countyi FIPSi codei toi createi ai
uniquei identifieri fori eachi rowi ofi data
Mergei Annuali Datai Sets
Excel
Mergei alli yearsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Unioni ofi Datai Sets)
Mergei withi Combinedi
Dataset
Excel
Mergei alli rowsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Fulli Join)
i
Centersi fori Diseasei Control
Diabetesi Atlas:i Diagnosedi Diabetesi Amongi Adultsi 20+i Years,i Age-Adjustedi
Percentage
County-leveli annuali datai (5-yeari runningi average)i fori alli 50i U.S.i Statesi fromi
2012-2020*
2020i datai takeni fromi Countyi Healthi Rankingsi (noti availablei viai CDCi
Diabetesi Atlas)
Step
Tool
Description
Dropi Unneededi Variablesi
(Columns)
Excel/
Dropi Statei FIPS
Renamei Variablei
(Columns)
Excel
Changei “Percentage”i toi Diabetesi Prevalence”
Createi Index
Excel
Mergei Yeari withi 5-digiti Countyi FIPSi codei toi createi ai
uniquei identifieri fori eachi rowi ofi data
Mergei Annuali Datai Sets
Excel
Mergei alli yearsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Unioni ofi Datai Sets)
Mergei withi Combinedi
Dataset
Excel
Mergei alli rowsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Fulli Join)
i
COVIDi Datai i USAi Facts
County-leveli dailyi datai fromi 1/1/2019i toi 6/9/22i fori alli 50i U.S.i States
Step
Tool
Description
Createi Newi Datai Table
Excel
Summaryi Datai Tablei Shelli fori annuali data
Addi onlyi year-endi values
Excel
Copyi datai fori county,i state,i FIPS,i andi lasti dayi ofi eachi
yeari (ori lasti availablei day)i intoi ai columni fori casesi ori
deathsi byi thei year
Renamei Variablei
(Columns)
Excel
Datai combinedi fromi 3i sheetsi (deaths,i cases,i
population),i withi columnsi renamedi withi Deathi ori
Cases,i plusi yeari (2020,i 2021,i 2022i fromi
“12/31/2020”,i “12/31/2021”,i “6/09/22”)
Createi Percentagei
Variablesi (Columns)
Excel
Createi newi columnsi thati converti rawi populationi
numbersi intoi percentagei sharesi ofi populationi (e.g.,i
2020i Casesi =i 2020i Cases/Pop)
Createi Index
Excel
Mergei Yeari withi 5-digiti Countyi FIPSi codei toi createi ai
uniquei identifieri fori eachi rowi ofi data
Mergei withi Combinedi
Dataset
Excel
Mergei alli rowsi usingi indexi (ID-Year)i asi thei primaryi
keyi (Lefti Join)
Stepi 2:
Usei Ri toi makei surei alli datai arei ini correcti formati (e.g.i numeric,i factor,i etc).
Usei Ri toi generatei summaryi statisticsi andi structurei ofi data,i andi producei
histograms/bari charts/boxi chartsi toi find:
Missing/Nulli Values;i significanti numberi ofi nullsi ini datai byi racei (Asian,i
Americani Indian,i andi Nativei Hawaiian)i ledi toi focusi oni Black,i Latino,i
andi Whitei populationsi fori healthi insurancei data.
Skewness
Outliers
Usei Ri toi createi correlationi matricesi toi identifyi correlations,i whichi cani bei usedi
toi reducei thei numberi ofi featuresi ini thei model.
Students also viewed