Need help with Project Portfolio
CSE 578: Data Visualization
Systems Documentation Report Members of Team 44: Pradeep Peddnade, Jieqiong Zhou, Tian Liang, Sukhwan Yun
1. Roles and responsibilities Product owners: XYZ corporation
Stakeholders: UVW College
Data analysis team members:
• Pradeep Peddnade: exploratory analysis for native-country, race, education and work class of the dataset, machining learning model training and testing of these variables.
• Jieqiong Zhou: Progress report, exploratory analysis for sex, marital-status of the dataset.
• Tian Liang: Systems documentation report; exploratory analysis for occupation, capital- loss, weight and working hours per week of the dataset; insight analysis for 2 variables
• Sukhwan Yun: Executive report, data exploration and data analysis of age, education– num, capital-gain and relationship of the data set.
2. Team goals and a business objective Our understanding of the project is to assist UVW College in their effort in boosting enrollment. They
believe they should target individuals based on their annual income. They drew a line at 50k and would
like us to classify individuals into two categories: annual salary above and below 50k.
We are going to use US census bureau data to establish correlation between annual income and the
other status and data of an individual, such as capital gain, capital loss, education, work class, marital
status, etc. We will start with an exploratory analysis to determine which parameters are important and
which ones are irrelevant. Then we will select the most relevant data for in-depth visualization and
machine learning. Eventually, we will be able to predict an individual’s annual salary based on this
person’s other status and data.
3. Assumptions UVW College assumes people within a certain salary range are more likely to enroll in their degree
program. Therefore, they need to know if a person’s annual salary is above or below $50,000.
UVW College assumes the US census data can be used to indicate the likelihood of a person’s annual
income based on other status and data such as age, gender, education status, marital status,
occupation, etc.
It is assumed that the data from the United States Census Bureau is accurate. The data used for this
study is representative of the individuals to be included in this data analysis.
4.User Stories User Story #1: To increase the enrollment number, a staff member of UVW marketing team would like
to know the relationship between occupation and income.
User Story #2: An associate in UVW marketing group would like to get an understanding of capital loss of
people in the data.
User Story #3: A marketing analyst suggested that work hours per week could be a factor affecting the
income of people and would like to have data to back this hypothesis
User Story #4: The director of marketing would like to know if final weight has anything to do with
income of people interviewed in the census data.
User Story #5: A senior staff member in marketing department is interested in how education-num is
correlated to income.
User Story #6: Marketing group just had a meeting on how to increase enrollment number. One of the
action items is to understand if marital status of an individual is related to this person’s salary.
User Story #7: An intern in marketing group suggested to study the relationship between capital gain
and income of individuals in this data.
User Story #8: The director of marketing asked the team members to analyze the relationship between
work class and annual salary of an individual.
5. Visualizations
Figure 1. Percentage of people with a salary > 50K for each occupation.
To visualize occupation data, we choose bar chart since it is very good for visualize categorical data.
Figure 1 shows the percentage of people with a salary > 50K for each occupation. It shows that certain
occupations such as the executive and managerial positions, professional specialty, protective services,
tech support have a higher percentage (>30%) of individuals with an annual salary more than $50,000.
While certain occupations such as private house services, other services, handlers-cleaners have a lower
percentage (<10%) of individuals with an annual salary higher than $50,000.
Figure 2. Education number of individuals with salary above (left) and below (right) $ 50,000.
To visualize education number data, we choose box chart since education number data are widespread
and box plot is very good at visualize this type of data. Figure 2 shows the education number of
individuals with salary above and below $ 50,000. In the group of individuals with annual salary above
$50,000, the median number of educations is higher than that of the group with annual salary below
$50,000. In addition, the top quartile of the people in the above $50,000 group is higher than that of the
people in the below $50,000 group. These two huge differences show that individuals with more
education are likely to be in the group with an annual salary above $50,000.
Figure 3. Percentage of individuals with annual salary above $ 50,000.
To visualize capital gain data, we choose scatter plot since it is suitable for showing relationship between
continuous data points. Figure 3 shows the percentage of individuals with annual salary above $ 50,000.
We processed the data in the following way to make the graph clearer. Each scatter in Figure 3
represents a group of individuals within a capital gain range of $2,000. For example, the scatter on the
most left side of the figure represents individuals with a capital gain between $0 to $2,000.
Based on Figure 3, we can see that for individuals with capital gain below $10,000, the percentage of
individuals with an annual salary above $50,000 is increased with the increase of capital gain. This
means there is a correlation between the salary and capital gain in this data range.
However, for individuals with a capital gain above $10,000, the correlation between salary and capital
gain is very weak, if there is any. We can see data points jump between 0 to 100% and there is no
pattern or correlation can be found. This is due to the scarcity of capital gain data points in the range
above $10,000. As we can see from Figure 4, Histogram of capital-gain, the majority of capital gain data
points are below $10,000. The data points above $10,000 are very scarce and have no statistical
significance.
In summary, annual salary is positively correlated to capital gain for individuals with capital gain less
than $10,000. For individuals with capital gain above $10,000, it is hard to draw a conclusion due to lack
of statistical significance of the data.
Figure 4. Histogram of capital-gain.
Figure 5. Histogram of capital-loss.
To explore capital loss data, we choose histogram to get some idea about if there is a correlation
between it and salary. Figure 5 shows the histogram of capital-loss. The vast majority of data points fall
in the very first bin on the left side of the figure. This indicates that there is no statistical correlation
between the capital-loss and annual salary.
Figure 6. Percentage of individuals with annual salary above $ 50,000 as a function of work hours-per-
week.
To visualize hours per week data, we choose scatter plot since it gives reader a very good idea about
continuous correlations. Figure 6 shows the percentage of individuals with annual salary above $50,000
as a function of work hours-per-week. We processed the data in the same way as we did for Figure 3 for
clarification purpose. Each scatter in Figure 6 represents a group of people work within a certain range
of hours per week. For examples, the scatter on the most left side of the graph represents people work
between 20 to25 hours per week.
We can see a very week correlation between salary and hour-per-week. Generally, individuals work less
than 40 hours per week are less likely to earn more than $50,000. While people work more than 40
hours a week are much more likely to make more than $50,000 annually.
Figure 7. Percentage of individuals with annual salary above $ 50,000 as a function of final weight.
We choose scatter plot to show the salary data as a function of final weight since final weight is a
continuous data. Figure 7 shows the percentage of individuals with annual salary above $ 50,000 as a
function of final weight. We process the data in the same as we did for Figure 6. Based on the scatter
plot, we can see there is no correlation between the salary and final weight.
Figure 8. Percentage of individuals in each work class with annual salary above $50,000 (top) and below
$50,000 (bottom).
To visualize work class data, we choose pie chart to see different classed in each salary group since pie
chart is very good at showing the composition of a data set. Figure 8 shows the percentage of individuals
in each work class with annual salary above $50,000 (top) and below $50,000 (bottom). It is very clear
that individuals work in the private sector are less likely to make more than $50,000 a year. While self
employed with an Inc are more likely to make more than $50,000 a year.
6. Questions Question 1: There is a total of 14 parameters. Some of them are relevant to the annual salary of an
individual, while some of them are not. We need to determine which parameters to use for in-depth
analysis and machine learning.
Solution to question 1: We started with a thorough discussion during one of our team meetings at the
beginning of this project. After the discussion, we decided to start with an exploratory analysis of each
parameter. After the initial analysis, we picked 4~8 parameters which are the most relevant to an
individual’s annual salary for the next step.
Question 2: Which machine learning method we should use for this study.
Solution to question 2: There are a total of 14 parameters, with some being numerical and the others
being categorical data. The outcome is either above or below 50k. Typically for this type of outcome,
logistic regression is an ideal method for machine learning analysis. Therefore, we choose logistic
regression and found how certain parameters are related to people’s annual salary.
7. Not doing In the machine learning analysis, we choose logistic regression model. This model is good enough to
provide information regarding the relationship between the parameters we choose, and individual’s
annual salary. In the future, more models can be included in the analysis. Several other metrics could be
used to compare different models and their accuracy.
In this study, we didn’t include parameters such as age, education, relationship, race, sex and native
country. These parameters can be included in further studies in the future.
8. Appendix Code:
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
import numpy
from statsmodels.graphics.mosaicplot import mosaic
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, trai
n_test_split
import warnings
%matplotlib inline
df = pd.read_csv("/content/adult.data", header=None, sep=", ")
df.columns = ["age", "workclass", "fnlwgt", "education", "education-
num", "marital-
status", "occupation", "relationship", "race", "sex", "capital-
gain", "capital-loss", "hours-per-week", "native-country", "class"]
df = df[df["workclass"] != '?']
df = df[df["education"] != '?']
df = df[df["marital-status"] != '?']
df = df[df["occupation"] != '?']
df = df[df["relationship"] != '?']
df = df[df["race"] != '?']
df = df[df["sex"] != '?']
df = df[df["native-country"] != '?']
below = df[df["class"] == "<=50K"]
above = df[df["class"] == ">50K"]
figg = plt.figure()
axx = figg.gca()
below.boxplot(column='education-num', ax =axx)
axx.set_title("Boxplot with a salary <= 50K for each education-num")
plt.show()
above_50k = Counter(above['workclass'])
below_50k = Counter(below['workclass'])
print('workclass')
fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))
axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct='%1.0f%%'
)
axes[0].set_title(">50K")
axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct='%1.0f%%'
)
axes[1].set_title("<=50K")
plt.show()
fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,10))
fig.subplots_adjust(hspace=.5)
mosaic(df, ['marital-status', 'class'], ax=axes, axes_label=False)
plt.show()
occupation_list = df.groupby('occupation')
occupations = occupation_list.groups.keys()
occupation_salary = []
for occupation in occupations:
occupation_member = df[df['occupation'] == occupation]
above_total = sum(occupation_member['salary'] == ' >50K')
below_total = sum(occupation_member['salary'] == ' <=50K')
occupation_salary.append([occupation, 100 * above_total/(below_total +
above_total)])
occupation_salary_df = pd.DataFrame(occupation_salary, columns = ['occupat
ion', 'per of >50K'])
plt.barh(occupation_salary_df['occupation'],occupation_salary_df['per of >
50K'])
plt.ylabel('Occupation')
plt.xlabel('Percentage of people with a salary > 50K (%)')
plt.title('Percentage of people with a salary > 50K for each occupation ',
fontdict = {'fontsize' : 20})
plt.show()
train = df
train = train.drop("capital-loss", axis=1)
train = train.drop("native-country", axis=1)
train = train.drop("fnlwgt", axis=1)
train = train.drop("education",axis=1)
def get_occupation(x):
if x in ["Exec-managerial", "Prof-specialty", "Protective-serv"]:
return 1
elif x in ["Sales", "Transport-moving", "Tech-support", "Craft-
repair"]:
return 2
else:
return 3
def get_relationship(x):
if x == "Own-child":
return 6
elif x == "Other-relative":
return 5
elif x == "Unmarried":
return 4
elif x == "Not-in-family":
return 3
elif x == "Husband":
return 2
else:
return 1
def get_race(x):
if x == "Other":
return 5
elif x == "Amer-Indian-Eskimo":
return 4
elif x == "Black":
return 3
elif x == "White":
return 2
else:
return 1
def get_sex(x):
if x == "Male":
return 2
else:
return 1
def get_class(x):
if x == ">50K":
return 1
else:
return 0
def get_workclass(x):
if x == "Without-pay":
return 7
elif x == "Private":
return 6
elif x == "State-gov":
return 5
elif x == "Self-emp-not-inc":
return 4
elif x == "Local-gov":
return 3
elif x == "Federal-gov":
return 2
else:
return 1
def get_marital_status(x):
if x == "Never-married":
return 7
elif x == "Separated":
return 6
elif x == "Married-spouse-absent":
return 5
elif x == "Widowed":
return 4
elif x == "Divorced":
return 3
elif x == "Married-civ-spouse":
return 2
else:
return 1
train['workclass'] = train['workclass'].apply(get_workclass)
train['marital-status'] = train['marital-
status'].apply(get_marital_status)
train['occupation'] = train['occupation'].apply(get_occupation)
train['relationship'] = train['relationship'].apply(get_relationship)
train['race'] = train['race'].apply(get_race)
train['sex'] = train['sex'].apply(get_sex)
train['class'] = train['class'].apply(get_class)
test=pd.read_csv("/content/adult.data", header=None, sep=", ")
feature = train.iloc[:, :-1]
labels = train.iloc[:, -1]
feature_matrix1 = feature.values
labels1 = labels.values
train_data, test_data, train_labels, test_labels = train_test_split(featur
e_matrix1, labels1, test_size=0.2, random_state=42)
transformed_train_data = MinMaxScaler().fit_transform(train_data)
transformed_test_data = MinMaxScaler().fit_transform(test_data)
mod=LogisticRegression().fit(transformed_train_data,train_labels)
test_predict=mod.predict(transformed_test_data)
acc=accuracy_score(test_labels, test_predict)
f1=f1_score(test_labels, test_predict)
prec=precision_score(test_labels,test_predict)
rec=recall_score(test_labels, test_predict)
print("%.4f\t%.4f\t%.4f\t%.4f\t%s" % (acc, f1, prec, rec, 'Logistic Regres
sion'))
factorC = 2000
df['capitalGainBin'] = df['capital-gain'] / factorC
df['capitalGainBin'] = df['capitalGainBin'].apply(np.ceil)
df['capitalGainBin'] = df['capitalGainBin'] * factorC
capitalGainBin_list = df.groupby('capitalGainBin')
capitalGainBins = capitalGainBin_list.groups.keys()
capitalGainBin_salary = []
for capitalGainBin in capitalGainBins:
capitalGainBin_member = df[df['capitalGainBin'] == capitalGainBin]
above_total = sum(capitalGainBin_member['salary'] == ' >50K')
below_total = sum(capitalGainBin_member['salary'] == ' <=50K')
capitalGainBin_salary.append([capitalGainBin, 100 * above_total/(below_t
otal + above_total)])
capitalGainBin_salary_df = pd.DataFrame(capitalGainBin_salary, columns = [
'capital-gain', 'per of >50K'])
plt.scatter(capitalGainBin_salary_df['capital-
gain'],capitalGainBin_salary_df['per of >50K'])
plt.xlabel('Capital-gain')
plt.ylabel('Percentage of people with a salary > 50K (%)')
plt.title('Percentage of people with a salary > 50K for different capitl g
ains', fontdict = {'fontsize' : 20})
plt.show()
plt.hist(df['capital-loss'])
plt.xlabel('Capital-loss')
plt.ylabel('Count')
plt.title('Distribution of capital-loss', fontdict = {'fontsize' : 20})
plt.show()
factorA = 100000
df['wgtBin'] = df['fnlwgt']/factorA
df['wgtBin'] = df['wgtBin'].apply(np.ceil)
df['wgtBin'] = df['wgtBin']*factorA
plt.hist(df['wgtBin'])
plt.xlabel('Final weight')
plt.ylabel('Count')
plt.title('Distribution of final weight', fontdict = {'fontsize' : 20})
plt.show()
wgtBin_list = df.groupby('wgtBin')
wgtBins = wgtBin_list.groups.keys()
wgtBins_salary = []
for wgtBin in wgtBins:
wgtBin_member = df[df['wgtBin'] == wgtBin]
above_total = sum(wgtBin_member['salary'] == ' >50K')
below_total = sum(wgtBin_member['salary'] == ' <=50K')
wgtBins_salary.append([wgtBin, 100 * above_total/(below_total + above_
total)])
wgtBins_salary_df = pd.DataFrame(wgtBins_salary, columns = ['fnlwgt', 'per
of >50K'])
plt.scatter(wgtBins_salary_df['fnlwgt'],wgtBins_salary_df['per of >50K'])
plt.xlabel('Final weight')
plt.ylabel('Percentage of people with a salary > 50K (%)')
plt.title('Percentage of people with a salary > 50K for different final we
ight', fontdict = {'fontsize' : 20})
plt.show()
plt.hist(df['hours-per-week'])
plt.xlabel('Hours-per-week')
plt.ylabel('Count')
plt.title('Distribution of hours-per-week', fontdict = {'fontsize' : 20})
plt.show()
factorB = 10
df['hours_per_weekBin'] = df['hours-per-week'] / factorB
df['hours_per_weekBin'] = df['hours_per_weekBin'].apply(np.ceil)
df['hours_per_weekBin'] = df['hours_per_weekBin'] * factorB
hours_per_weekBin_list = df.groupby('hours_per_weekBin')
hours_per_weekBins = hours_per_weekBin_list.groups.keys()
hours_per_weekBin_salary = []
for hours_per_weekBin in hours_per_weekBins:
hours_per_weekBin_member = df[df['hours_per_weekBin'] == hours_per_wee
kBin]
above_total = sum(hours_per_weekBin_member['salary'] == ' >50K')
below_total = sum(hours_per_weekBin_member['salary'] == ' <=50K')
hours_per_weekBin_salary.append([hours_per_weekBin, 100 * above_total/
(below_total + above_total)])
hours_per_weekBin_salary_df = pd.DataFrame(hours_per_weekBin_salary, colum
ns = ['hours-per-week', 'per of >50K'])
plt.scatter(hours_per_weekBin_salary_df['hours-per-
week'],hours_per_weekBin_salary_df['per of >50K'])
plt.xlabel('hours-per-week')
plt.ylabel('Percentage of people with a salary > 50K (%)')
plt.title('Percentage of people with a salary > 50K for different hours-
per-weeks', fontdict = {'fontsize' : 20})
plt.show()