Need help with Project Portfolio

peddnadepradeep
Systemsreport.pdf

CSE 578: Data Visualization

Systems Documentation Report Members of Team 44: Pradeep Peddnade, Jieqiong Zhou, Tian Liang, Sukhwan Yun

1. Roles and responsibilities Product owners: XYZ corporation

Stakeholders: UVW College

Data analysis team members:

• Pradeep Peddnade: exploratory analysis for native-country, race, education and work class of the dataset, machining learning model training and testing of these variables.

• Jieqiong Zhou: Progress report, exploratory analysis for sex, marital-status of the dataset.

• Tian Liang: Systems documentation report; exploratory analysis for occupation, capital- loss, weight and working hours per week of the dataset; insight analysis for 2 variables

• Sukhwan Yun: Executive report, data exploration and data analysis of age, education– num, capital-gain and relationship of the data set.

2. Team goals and a business objective Our understanding of the project is to assist UVW College in their effort in boosting enrollment. They

believe they should target individuals based on their annual income. They drew a line at 50k and would

like us to classify individuals into two categories: annual salary above and below 50k.

We are going to use US census bureau data to establish correlation between annual income and the

other status and data of an individual, such as capital gain, capital loss, education, work class, marital

status, etc. We will start with an exploratory analysis to determine which parameters are important and

which ones are irrelevant. Then we will select the most relevant data for in-depth visualization and

machine learning. Eventually, we will be able to predict an individual’s annual salary based on this

person’s other status and data.

3. Assumptions UVW College assumes people within a certain salary range are more likely to enroll in their degree

program. Therefore, they need to know if a person’s annual salary is above or below $50,000.

UVW College assumes the US census data can be used to indicate the likelihood of a person’s annual

income based on other status and data such as age, gender, education status, marital status,

occupation, etc.

It is assumed that the data from the United States Census Bureau is accurate. The data used for this

study is representative of the individuals to be included in this data analysis.

4.User Stories User Story #1: To increase the enrollment number, a staff member of UVW marketing team would like

to know the relationship between occupation and income.

User Story #2: An associate in UVW marketing group would like to get an understanding of capital loss of

people in the data.

User Story #3: A marketing analyst suggested that work hours per week could be a factor affecting the

income of people and would like to have data to back this hypothesis

User Story #4: The director of marketing would like to know if final weight has anything to do with

income of people interviewed in the census data.

User Story #5: A senior staff member in marketing department is interested in how education-num is

correlated to income.

User Story #6: Marketing group just had a meeting on how to increase enrollment number. One of the

action items is to understand if marital status of an individual is related to this person’s salary.

User Story #7: An intern in marketing group suggested to study the relationship between capital gain

and income of individuals in this data.

User Story #8: The director of marketing asked the team members to analyze the relationship between

work class and annual salary of an individual.

5. Visualizations

Figure 1. Percentage of people with a salary > 50K for each occupation.

To visualize occupation data, we choose bar chart since it is very good for visualize categorical data.

Figure 1 shows the percentage of people with a salary > 50K for each occupation. It shows that certain

occupations such as the executive and managerial positions, professional specialty, protective services,

tech support have a higher percentage (>30%) of individuals with an annual salary more than $50,000.

While certain occupations such as private house services, other services, handlers-cleaners have a lower

percentage (<10%) of individuals with an annual salary higher than $50,000.

Figure 2. Education number of individuals with salary above (left) and below (right) $ 50,000.

To visualize education number data, we choose box chart since education number data are widespread

and box plot is very good at visualize this type of data. Figure 2 shows the education number of

individuals with salary above and below $ 50,000. In the group of individuals with annual salary above

$50,000, the median number of educations is higher than that of the group with annual salary below

$50,000. In addition, the top quartile of the people in the above $50,000 group is higher than that of the

people in the below $50,000 group. These two huge differences show that individuals with more

education are likely to be in the group with an annual salary above $50,000.

Figure 3. Percentage of individuals with annual salary above $ 50,000.

To visualize capital gain data, we choose scatter plot since it is suitable for showing relationship between

continuous data points. Figure 3 shows the percentage of individuals with annual salary above $ 50,000.

We processed the data in the following way to make the graph clearer. Each scatter in Figure 3

represents a group of individuals within a capital gain range of $2,000. For example, the scatter on the

most left side of the figure represents individuals with a capital gain between $0 to $2,000.

Based on Figure 3, we can see that for individuals with capital gain below $10,000, the percentage of

individuals with an annual salary above $50,000 is increased with the increase of capital gain. This

means there is a correlation between the salary and capital gain in this data range.

However, for individuals with a capital gain above $10,000, the correlation between salary and capital

gain is very weak, if there is any. We can see data points jump between 0 to 100% and there is no

pattern or correlation can be found. This is due to the scarcity of capital gain data points in the range

above $10,000. As we can see from Figure 4, Histogram of capital-gain, the majority of capital gain data

points are below $10,000. The data points above $10,000 are very scarce and have no statistical

significance.

In summary, annual salary is positively correlated to capital gain for individuals with capital gain less

than $10,000. For individuals with capital gain above $10,000, it is hard to draw a conclusion due to lack

of statistical significance of the data.

Figure 4. Histogram of capital-gain.

Figure 5. Histogram of capital-loss.

To explore capital loss data, we choose histogram to get some idea about if there is a correlation

between it and salary. Figure 5 shows the histogram of capital-loss. The vast majority of data points fall

in the very first bin on the left side of the figure. This indicates that there is no statistical correlation

between the capital-loss and annual salary.

Figure 6. Percentage of individuals with annual salary above $ 50,000 as a function of work hours-per-

week.

To visualize hours per week data, we choose scatter plot since it gives reader a very good idea about

continuous correlations. Figure 6 shows the percentage of individuals with annual salary above $50,000

as a function of work hours-per-week. We processed the data in the same way as we did for Figure 3 for

clarification purpose. Each scatter in Figure 6 represents a group of people work within a certain range

of hours per week. For examples, the scatter on the most left side of the graph represents people work

between 20 to25 hours per week.

We can see a very week correlation between salary and hour-per-week. Generally, individuals work less

than 40 hours per week are less likely to earn more than $50,000. While people work more than 40

hours a week are much more likely to make more than $50,000 annually.

Figure 7. Percentage of individuals with annual salary above $ 50,000 as a function of final weight.

We choose scatter plot to show the salary data as a function of final weight since final weight is a

continuous data. Figure 7 shows the percentage of individuals with annual salary above $ 50,000 as a

function of final weight. We process the data in the same as we did for Figure 6. Based on the scatter

plot, we can see there is no correlation between the salary and final weight.

Figure 8. Percentage of individuals in each work class with annual salary above $50,000 (top) and below

$50,000 (bottom).

To visualize work class data, we choose pie chart to see different classed in each salary group since pie

chart is very good at showing the composition of a data set. Figure 8 shows the percentage of individuals

in each work class with annual salary above $50,000 (top) and below $50,000 (bottom). It is very clear

that individuals work in the private sector are less likely to make more than $50,000 a year. While self

employed with an Inc are more likely to make more than $50,000 a year.

6. Questions Question 1: There is a total of 14 parameters. Some of them are relevant to the annual salary of an

individual, while some of them are not. We need to determine which parameters to use for in-depth

analysis and machine learning.

Solution to question 1: We started with a thorough discussion during one of our team meetings at the

beginning of this project. After the discussion, we decided to start with an exploratory analysis of each

parameter. After the initial analysis, we picked 4~8 parameters which are the most relevant to an

individual’s annual salary for the next step.

Question 2: Which machine learning method we should use for this study.

Solution to question 2: There are a total of 14 parameters, with some being numerical and the others

being categorical data. The outcome is either above or below 50k. Typically for this type of outcome,

logistic regression is an ideal method for machine learning analysis. Therefore, we choose logistic

regression and found how certain parameters are related to people’s annual salary.

7. Not doing In the machine learning analysis, we choose logistic regression model. This model is good enough to

provide information regarding the relationship between the parameters we choose, and individual’s

annual salary. In the future, more models can be included in the analysis. Several other metrics could be

used to compare different models and their accuracy.

In this study, we didn’t include parameters such as age, education, relationship, race, sex and native

country. These parameters can be included in further studies in the future.

8. Appendix Code:

import pandas as pd

import numpy as np

from collections import Counter

import matplotlib.pyplot as plt

import numpy

from statsmodels.graphics.mosaicplot import mosaic

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score,

f1_score

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, trai

n_test_split

import warnings

%matplotlib inline

df = pd.read_csv("/content/adult.data", header=None, sep=", ")

df.columns = ["age", "workclass", "fnlwgt", "education", "education-

num", "marital-

status", "occupation", "relationship", "race", "sex", "capital-

gain", "capital-loss", "hours-per-week", "native-country", "class"]

df = df[df["workclass"] != '?']

df = df[df["education"] != '?']

df = df[df["marital-status"] != '?']

df = df[df["occupation"] != '?']

df = df[df["relationship"] != '?']

df = df[df["race"] != '?']

df = df[df["sex"] != '?']

df = df[df["native-country"] != '?']

below = df[df["class"] == "<=50K"]

above = df[df["class"] == ">50K"]

figg = plt.figure()

axx = figg.gca()

below.boxplot(column='education-num', ax =axx)

axx.set_title("Boxplot with a salary <= 50K for each education-num")

plt.show()

above_50k = Counter(above['workclass'])

below_50k = Counter(below['workclass'])

print('workclass')

fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))

axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct='%1.0f%%'

)

axes[0].set_title(">50K")

axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct='%1.0f%%'

)

axes[1].set_title("<=50K")

plt.show()

fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,10))

fig.subplots_adjust(hspace=.5)

mosaic(df, ['marital-status', 'class'], ax=axes, axes_label=False)

plt.show()

occupation_list = df.groupby('occupation')

occupations = occupation_list.groups.keys()

occupation_salary = []

for occupation in occupations:

occupation_member = df[df['occupation'] == occupation]

above_total = sum(occupation_member['salary'] == ' >50K')

below_total = sum(occupation_member['salary'] == ' <=50K')

occupation_salary.append([occupation, 100 * above_total/(below_total +

above_total)])

occupation_salary_df = pd.DataFrame(occupation_salary, columns = ['occupat

ion', 'per of >50K'])

plt.barh(occupation_salary_df['occupation'],occupation_salary_df['per of >

50K'])

plt.ylabel('Occupation')

plt.xlabel('Percentage of people with a salary > 50K (%)')

plt.title('Percentage of people with a salary > 50K for each occupation ',

fontdict = {'fontsize' : 20})

plt.show()

train = df

train = train.drop("capital-loss", axis=1)

train = train.drop("native-country", axis=1)

train = train.drop("fnlwgt", axis=1)

train = train.drop("education",axis=1)

def get_occupation(x):

if x in ["Exec-managerial", "Prof-specialty", "Protective-serv"]:

return 1

elif x in ["Sales", "Transport-moving", "Tech-support", "Craft-

repair"]:

return 2

else:

return 3

def get_relationship(x):

if x == "Own-child":

return 6

elif x == "Other-relative":

return 5

elif x == "Unmarried":

return 4

elif x == "Not-in-family":

return 3

elif x == "Husband":

return 2

else:

return 1

def get_race(x):

if x == "Other":

return 5

elif x == "Amer-Indian-Eskimo":

return 4

elif x == "Black":

return 3

elif x == "White":

return 2

else:

return 1

def get_sex(x):

if x == "Male":

return 2

else:

return 1

def get_class(x):

if x == ">50K":

return 1

else:

return 0

def get_workclass(x):

if x == "Without-pay":

return 7

elif x == "Private":

return 6

elif x == "State-gov":

return 5

elif x == "Self-emp-not-inc":

return 4

elif x == "Local-gov":

return 3

elif x == "Federal-gov":

return 2

else:

return 1

def get_marital_status(x):

if x == "Never-married":

return 7

elif x == "Separated":

return 6

elif x == "Married-spouse-absent":

return 5

elif x == "Widowed":

return 4

elif x == "Divorced":

return 3

elif x == "Married-civ-spouse":

return 2

else:

return 1

train['workclass'] = train['workclass'].apply(get_workclass)

train['marital-status'] = train['marital-

status'].apply(get_marital_status)

train['occupation'] = train['occupation'].apply(get_occupation)

train['relationship'] = train['relationship'].apply(get_relationship)

train['race'] = train['race'].apply(get_race)

train['sex'] = train['sex'].apply(get_sex)

train['class'] = train['class'].apply(get_class)

test=pd.read_csv("/content/adult.data", header=None, sep=", ")

feature = train.iloc[:, :-1]

labels = train.iloc[:, -1]

feature_matrix1 = feature.values

labels1 = labels.values

train_data, test_data, train_labels, test_labels = train_test_split(featur

e_matrix1, labels1, test_size=0.2, random_state=42)

transformed_train_data = MinMaxScaler().fit_transform(train_data)

transformed_test_data = MinMaxScaler().fit_transform(test_data)

mod=LogisticRegression().fit(transformed_train_data,train_labels)

test_predict=mod.predict(transformed_test_data)

acc=accuracy_score(test_labels, test_predict)

f1=f1_score(test_labels, test_predict)

prec=precision_score(test_labels,test_predict)

rec=recall_score(test_labels, test_predict)

print("%.4f\t%.4f\t%.4f\t%.4f\t%s" % (acc, f1, prec, rec, 'Logistic Regres

sion'))

factorC = 2000

df['capitalGainBin'] = df['capital-gain'] / factorC

df['capitalGainBin'] = df['capitalGainBin'].apply(np.ceil)

df['capitalGainBin'] = df['capitalGainBin'] * factorC

capitalGainBin_list = df.groupby('capitalGainBin')

capitalGainBins = capitalGainBin_list.groups.keys()

capitalGainBin_salary = []

for capitalGainBin in capitalGainBins:

capitalGainBin_member = df[df['capitalGainBin'] == capitalGainBin]

above_total = sum(capitalGainBin_member['salary'] == ' >50K')

below_total = sum(capitalGainBin_member['salary'] == ' <=50K')

capitalGainBin_salary.append([capitalGainBin, 100 * above_total/(below_t

otal + above_total)])

capitalGainBin_salary_df = pd.DataFrame(capitalGainBin_salary, columns = [

'capital-gain', 'per of >50K'])

plt.scatter(capitalGainBin_salary_df['capital-

gain'],capitalGainBin_salary_df['per of >50K'])

plt.xlabel('Capital-gain')

plt.ylabel('Percentage of people with a salary > 50K (%)')

plt.title('Percentage of people with a salary > 50K for different capitl g

ains', fontdict = {'fontsize' : 20})

plt.show()

plt.hist(df['capital-loss'])

plt.xlabel('Capital-loss')

plt.ylabel('Count')

plt.title('Distribution of capital-loss', fontdict = {'fontsize' : 20})

plt.show()

factorA = 100000

df['wgtBin'] = df['fnlwgt']/factorA

df['wgtBin'] = df['wgtBin'].apply(np.ceil)

df['wgtBin'] = df['wgtBin']*factorA

plt.hist(df['wgtBin'])

plt.xlabel('Final weight')

plt.ylabel('Count')

plt.title('Distribution of final weight', fontdict = {'fontsize' : 20})

plt.show()

wgtBin_list = df.groupby('wgtBin')

wgtBins = wgtBin_list.groups.keys()

wgtBins_salary = []

for wgtBin in wgtBins:

wgtBin_member = df[df['wgtBin'] == wgtBin]

above_total = sum(wgtBin_member['salary'] == ' >50K')

below_total = sum(wgtBin_member['salary'] == ' <=50K')

wgtBins_salary.append([wgtBin, 100 * above_total/(below_total + above_

total)])

wgtBins_salary_df = pd.DataFrame(wgtBins_salary, columns = ['fnlwgt', 'per

of >50K'])

plt.scatter(wgtBins_salary_df['fnlwgt'],wgtBins_salary_df['per of >50K'])

plt.xlabel('Final weight')

plt.ylabel('Percentage of people with a salary > 50K (%)')

plt.title('Percentage of people with a salary > 50K for different final we

ight', fontdict = {'fontsize' : 20})

plt.show()

plt.hist(df['hours-per-week'])

plt.xlabel('Hours-per-week')

plt.ylabel('Count')

plt.title('Distribution of hours-per-week', fontdict = {'fontsize' : 20})

plt.show()

factorB = 10

df['hours_per_weekBin'] = df['hours-per-week'] / factorB

df['hours_per_weekBin'] = df['hours_per_weekBin'].apply(np.ceil)

df['hours_per_weekBin'] = df['hours_per_weekBin'] * factorB

hours_per_weekBin_list = df.groupby('hours_per_weekBin')

hours_per_weekBins = hours_per_weekBin_list.groups.keys()

hours_per_weekBin_salary = []

for hours_per_weekBin in hours_per_weekBins:

hours_per_weekBin_member = df[df['hours_per_weekBin'] == hours_per_wee

kBin]

above_total = sum(hours_per_weekBin_member['salary'] == ' >50K')

below_total = sum(hours_per_weekBin_member['salary'] == ' <=50K')

hours_per_weekBin_salary.append([hours_per_weekBin, 100 * above_total/

(below_total + above_total)])

hours_per_weekBin_salary_df = pd.DataFrame(hours_per_weekBin_salary, colum

ns = ['hours-per-week', 'per of >50K'])

plt.scatter(hours_per_weekBin_salary_df['hours-per-

week'],hours_per_weekBin_salary_df['per of >50K'])

plt.xlabel('hours-per-week')

plt.ylabel('Percentage of people with a salary > 50K (%)')

plt.title('Percentage of people with a salary > 50K for different hours-

per-weeks', fontdict = {'fontsize' : 20})

plt.show()