Ds2

Background.docx

Background

· For this assignment, you will be using the Cleansing_Week4.R script and the data.csv

· The code is in R programming language. You should open R studio and open the file Cleansing_Week4.R. Follow the steps in the code and answer each of the following questions below.

· Some manipulation and rework of the code is required. The steps are explained in detail in the Code.

· Steps 0 through Step 5, included, should be completed.

Instructions

You should complete all the steps provided in the code and answer the following questions in a report.

After you complete your readings, and listen to the provided videos (Required), you will proceed with this implementation and report.

1. Introduction

· Provide information about the Language, GUI, and Data File you are using in this assignment. Use references to support the importance of the language you are using, the advantages, disadvantages, and how it relates to other languages that are used in Data Science.

· Provide the Value stored in the variable Randomizer in your code and your Student ID in this section. Take a printscreen of the output in your Console and paste it here.

2. Data Presentation before Cleansing

Run Step 0 and answer the following questions.

A. Data file format and the corresponding command that you used to read the data. Does the file have headers?

B. How many observations are there?

C. How many variables are in the data?

D. What is the purpose of the command str(df). Take a printscreen of the output in your Console and paste it here.

E. summary(df) # find out what this means and answer the question in your paper.

F. Answer the following questions:

a. # What type of variables does your file include

b. # Specific data types?

c. # Are they read properly?

d. # Are there any issues?

e. # Does your file include both NAs and blanks? How did you identify those?

f. # How many NAs do you have and

g. # How many blanks?

3. Data Preprocessing

A. Summarize the steps of preprocessing you expect to complete before you run the previous steps in your code. Recommend methods of inputting NAs in each of the variables when needed, and or observations. Review literature and suggest methods of imputation for Categorical and Numeric Variables.

B. Run the Step 1 in your code. How this step affected the NAs and the blanks in your variables (you can run summary(df)) to determine this. Take a printscreen of the output in your Console and paste it here.

C. For each of the Numeric Variables record the Mean and the Median, for the Categorical Variables record the counts. Present them on your paper on a table.

D. Run Steps 2-3 and 4. How many observations include NAs, how many variables include NAs, what is the percentage of rows and columns that have NAs, if we were to eliminate those, what is the approximate size of the remaining dataset? Is this the proper method of imputing?

E. Run Step 5 and answer the following questions:

1.

a. What is the method of imputation that is described? What does linear interpolation mean? Research and discuss if this is an appropriate method. The above method of imputation has now changed some of the statistics of your variables.

· Run summary(df) and compare with the previous statistics. Take a printscreen of the output in your Console and paste it here.

· Do you observe any undesired changes? Explain in detail, how could you have avoided this?

· Are there any more NA's in your file?

Length: This assignment must be 4-5 pages (excluding the title and reference page)

DraftCleansing_Week_41.R

# Do not change the code between this line and line 4 # Install and load necessary libraries install.packages("ggplot2")# Install scales for plotting install.packages("scales")# Install scales for formatting install.packages("moments")# Install moments for skewness and kurtosis library(ggplot2) # Load ggplot2 library library(scales) # Load scales library library(moments) # Load moments library ###################################################### # the following command returns in the Console your current directory getwd() # in the next line, change the directory to the place where you saved the # data file, if you prefer you can save your data.csv file in the directory # tha command 7 indicated. # for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents") setwd("C:/Users/tsapa/OneDrive/Documents") ########################################################## # Read your file here, the file you read is a csv file # Comma delimited # the first parameter is the name and should be included in quotes # header is the parameter that tells you if there are Column names in your file # sep stands for separator # stringsAsFactors ensures that the variables that are text in the CSV but they # are categorical, have specific values that separate the rows in categories are # read as factors df <- read.csv("data.csv", header = TRUE, sep= ",",stringsAsFactors = TRUE) # ########################################################## ## In the following part of your code you should set variable # A with the first 3 digits of your student ID # B with the last 3 digits of your student ID # For example if your ID is 348988456399 then you should enter # A <- 34 # B <-99 ########################################################### A <- 34 B <- 99 Randomizer <- A+B ########################################################### # Do not change the code here to line that starts with # 'random_sample' now contains a random sample of 'sample_size' rows set.seed(Randomizer) # Set the number of rows you want in the sample sample_size <- 500 # Use the sample() function to randomize the rows random_sample <- df[sample(nrow(df), sample_size, replace = TRUE), ] df <- random_sample # 'random_sample' now contains a random sample of 'sample_size' rows from the original dataframe 'df' ############################################################ # Step 0. Now that you read the file, you want to learn few information # about your data ############################################################# # The following commands will not be explained here, do your # research, review your csv file and answer the questions related # with this part of your code. nrow(df) # find out what this means and answer the question in your paper length(df) # find out what this means and answer the question in your paper str(df) # find out what this means and answer the question in your paper summary(df) # find out what this means and answer the question in your paper # use the last to identify NAs in your dataframe. ########################################################################## # What type of variables does your file include # Specific data types? # Are they read properly? # Are there any issues ? # does your file includes both NAs and blanks? # How many NAs do you have and # How many blanks? # ########################################################################## # # Step 1: # Handling both blanks and NAs is not simple so first we want to eliminate # some of those, let's eliminate the blanks and change them to NAs # ############################################################# # Loop through columns for (i in 1:length(df)) { # Loop through rows for (j in 1:nrow(df)) { # Check for empty strings or NA values if (df[j, i] == "" | is.na(df[j, i])) { # Replace with actual NA value (not a string "NA") df[j, i] <- NA } } } # Now run one of the commands in lines 56 to 60 to identify how the number of NAs has change # Record this in your paper. # summary(df) # Now you need to drop the blank levels ( categories from your factor variables) df$Gender <- as.factor(as.character(df$Gender)) df$Education <- as.factor(as.character(df$Education)) df$Grade <- as.factor(as.character(df$Grade)) df$MaritalStatus <- as.factor(as.character(df$MaritalStatus)) df$Category <- as.factor(as.character(df$Category)) df$Employment <- as.factor(as.character(df$Employment)) df$Category <- as.factor(as.character(df$Category)) df$Color <- as.factor(as.character(df$Color)) df$Hobby <- as.factor(as.character(df$Hobby)) df$Location <- as.factor(as.character(df$Location)) ################################################################# # # Step 2: Count NAs in the entire dataset ############################################################# total_nas <- sum(is.na(df)) total_nas # Explain what the printed number is? ################################################################# # # Step 3: Count rows with NAs ############################################################# rows_with_nas <- sum(rowSums(is.na(df)) > 0) rows_with_nas Percent_row_NA <- percent(rows_with_nas/nrow(df)) Percent_row_NA # How large is the proportion of the rows with NAs, we can drop up to 5% # but do you think that would be wise to drop the above percent? # How this will affect your dataset? ################################################################# # # Step 4: Count columns with NAs ############################################################# cols_with_nas <- sum(colSums(is.na(df)) > 0) cols_with_nas Percent_col_NA <- percent(cols_with_nas/length(df)) Percent_col_NA # How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we # will loose variables and associations # but do you think that would be wise to drop the above percent? # How this will affect your dataset? ################################################################# # # Step 5: Replace NAs with appropriate values (mean for numeric and integer, # mode for factor, "NA" for character) # In later weeks we will learn how to replace the NAs properly based on the # descriptive statistics and you will discuss this code. # for now, you can assume that by setting the mean of the variable for numeric # and mode for categorical it is correct - this is not always the case of course # but the code will become much more complicated in that case. ############################################################# for (col in names(df)) { if (is.numeric(df[[col]]) || is.integer(df[[col]])) { if (sum(!is.na(df[[col]])) > 10) { # If more than 10 non-NA values, use mean df[[col]][is.na(df[[col]])] <- mean(df[[col]], na.rm = TRUE) } else { # Otherwise, use linear interpolation for imputation df[[col]][is.na(df[[col]])] <- approx(seq_along(df[[col]]), df[[col]], n = length(df[[col]]))[["y"]][is.na(df[[col]])] } } else if (is.factor(df[[col]])) { mode_val <- names(sort(-table(df[[col]])))[1] df[[col]][is.na(df[[col]])] <- mode_val } else if (is.character(df[[col]])) { df[[col]][is.na(df[[col]])] <- "NA" } } ################################################################# # # following the above method to impute, has now changed some of the statistics # # Run summary(df) and compare with the previous statistics # Do you observe any undesired changes? Explain in detail # Are there any more NA's in your file? ################################################################# summary(df) # complete your Week 5 paper

Background.docx

Background

· For this assignment, you will be using the Cleansing_Week4.R script and the data.csv

· The code is in R programming language. You should open R studio and open the file Cleansing_Week4.R. Follow the steps in the code and answer each of the following questions below.

· Some manipulation and rework of the code is required. The steps are explained in detail in the Code.

· Steps 0 through Step 5, included, should be completed.

Instructions

You should complete all the steps provided in the code and answer the following questions in a report.

After you complete your readings, and listen to the provided videos (Required), you will proceed with this implementation and report.

1. Introduction

· Provide information about the Language, GUI, and Data File you are using in this assignment. Use references to support the importance of the language you are using, the advantages, disadvantages, and how it relates to other languages that are used in Data Science.

· Provide the Value stored in the variable Randomizer in your code and your Student ID in this section. Take a printscreen of the output in your Console and paste it here.

2. Data Presentation before Cleansing

Run Step 0 and answer the following questions.

A. Data file format and the corresponding command that you used to read the data. Does the file have headers?

B. How many observations are there?

C. How many variables are in the data?

D. What is the purpose of the command str(df). Take a printscreen of the output in your Console and paste it here.

E. summary(df) # find out what this means and answer the question in your paper.

F. Answer the following questions:

a. # What type of variables does your file include

b. # Specific data types?

c. # Are they read properly?

d. # Are there any issues?

e. # Does your file include both NAs and blanks? How did you identify those?

f. # How many NAs do you have and

g. # How many blanks?

3. Data Preprocessing

A. Summarize the steps of preprocessing you expect to complete before you run the previous steps in your code. Recommend methods of inputting NAs in each of the variables when needed, and or observations. Review literature and suggest methods of imputation for Categorical and Numeric Variables.

B. Run the Step 1 in your code. How this step affected the NAs and the blanks in your variables (you can run summary(df)) to determine this. Take a printscreen of the output in your Console and paste it here.

C. For each of the Numeric Variables record the Mean and the Median, for the Categorical Variables record the counts. Present them on your paper on a table.

D. Run Steps 2-3 and 4. How many observations include NAs, how many variables include NAs, what is the percentage of rows and columns that have NAs, if we were to eliminate those, what is the approximate size of the remaining dataset? Is this the proper method of imputing?

E. Run Step 5 and answer the following questions:

1.

a. What is the method of imputation that is described? What does linear interpolation mean? Research and discuss if this is an appropriate method. The above method of imputation has now changed some of the statistics of your variables.

· Run summary(df) and compare with the previous statistics. Take a printscreen of the output in your Console and paste it here.

· Do you observe any undesired changes? Explain in detail, how could you have avoided this?

· Are there any more NA's in your file?

Length: This assignment must be 4-5 pages (excluding the title and reference page)

DraftCleansing_Week_41.R

# Do not change the code between this line and line 4 # Install and load necessary libraries install.packages("ggplot2")# Install scales for plotting install.packages("scales")# Install scales for formatting install.packages("moments")# Install moments for skewness and kurtosis library(ggplot2) # Load ggplot2 library library(scales) # Load scales library library(moments) # Load moments library ###################################################### # the following command returns in the Console your current directory getwd() # in the next line, change the directory to the place where you saved the # data file, if you prefer you can save your data.csv file in the directory # tha command 7 indicated. # for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents") setwd("C:/Users/tsapa/OneDrive/Documents") ########################################################## # Read your file here, the file you read is a csv file # Comma delimited # the first parameter is the name and should be included in quotes # header is the parameter that tells you if there are Column names in your file # sep stands for separator # stringsAsFactors ensures that the variables that are text in the CSV but they # are categorical, have specific values that separate the rows in categories are # read as factors df <- read.csv("data.csv", header = TRUE, sep= ",",stringsAsFactors = TRUE) # ########################################################## ## In the following part of your code you should set variable # A with the first 3 digits of your student ID # B with the last 3 digits of your student ID # For example if your ID is 348988456399 then you should enter # A <- 34 # B <-99 ########################################################### A <- 34 B <- 99 Randomizer <- A+B ########################################################### # Do not change the code here to line that starts with # 'random_sample' now contains a random sample of 'sample_size' rows set.seed(Randomizer) # Set the number of rows you want in the sample sample_size <- 500 # Use the sample() function to randomize the rows random_sample <- df[sample(nrow(df), sample_size, replace = TRUE), ] df <- random_sample # 'random_sample' now contains a random sample of 'sample_size' rows from the original dataframe 'df' ############################################################ # Step 0. Now that you read the file, you want to learn few information # about your data ############################################################# # The following commands will not be explained here, do your # research, review your csv file and answer the questions related # with this part of your code. nrow(df) # find out what this means and answer the question in your paper length(df) # find out what this means and answer the question in your paper str(df) # find out what this means and answer the question in your paper summary(df) # find out what this means and answer the question in your paper # use the last to identify NAs in your dataframe. ########################################################################## # What type of variables does your file include # Specific data types? # Are they read properly? # Are there any issues ? # does your file includes both NAs and blanks? # How many NAs do you have and # How many blanks? # ########################################################################## # # Step 1: # Handling both blanks and NAs is not simple so first we want to eliminate # some of those, let's eliminate the blanks and change them to NAs # ############################################################# # Loop through columns for (i in 1:length(df)) { # Loop through rows for (j in 1:nrow(df)) { # Check for empty strings or NA values if (df[j, i] == "" | is.na(df[j, i])) { # Replace with actual NA value (not a string "NA") df[j, i] <- NA } } } # Now run one of the commands in lines 56 to 60 to identify how the number of NAs has change # Record this in your paper. # summary(df) # Now you need to drop the blank levels ( categories from your factor variables) df$Gender <- as.factor(as.character(df$Gender)) df$Education <- as.factor(as.character(df$Education)) df$Grade <- as.factor(as.character(df$Grade)) df$MaritalStatus <- as.factor(as.character(df$MaritalStatus)) df$Category <- as.factor(as.character(df$Category)) df$Employment <- as.factor(as.character(df$Employment)) df$Category <- as.factor(as.character(df$Category)) df$Color <- as.factor(as.character(df$Color)) df$Hobby <- as.factor(as.character(df$Hobby)) df$Location <- as.factor(as.character(df$Location)) ################################################################# # # Step 2: Count NAs in the entire dataset ############################################################# total_nas <- sum(is.na(df)) total_nas # Explain what the printed number is? ################################################################# # # Step 3: Count rows with NAs ############################################################# rows_with_nas <- sum(rowSums(is.na(df)) > 0) rows_with_nas Percent_row_NA <- percent(rows_with_nas/nrow(df)) Percent_row_NA # How large is the proportion of the rows with NAs, we can drop up to 5% # but do you think that would be wise to drop the above percent? # How this will affect your dataset? ################################################################# # # Step 4: Count columns with NAs ############################################################# cols_with_nas <- sum(colSums(is.na(df)) > 0) cols_with_nas Percent_col_NA <- percent(cols_with_nas/length(df)) Percent_col_NA # How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we # will loose variables and associations # but do you think that would be wise to drop the above percent? # How this will affect your dataset? ################################################################# # # Step 5: Replace NAs with appropriate values (mean for numeric and integer, # mode for factor, "NA" for character) # In later weeks we will learn how to replace the NAs properly based on the # descriptive statistics and you will discuss this code. # for now, you can assume that by setting the mean of the variable for numeric # and mode for categorical it is correct - this is not always the case of course # but the code will become much more complicated in that case. ############################################################# for (col in names(df)) { if (is.numeric(df[[col]]) || is.integer(df[[col]])) { if (sum(!is.na(df[[col]])) > 10) { # If more than 10 non-NA values, use mean df[[col]][is.na(df[[col]])] <- mean(df[[col]], na.rm = TRUE) } else { # Otherwise, use linear interpolation for imputation df[[col]][is.na(df[[col]])] <- approx(seq_along(df[[col]]), df[[col]], n = length(df[[col]]))[["y"]][is.na(df[[col]])] } } else if (is.factor(df[[col]])) { mode_val <- names(sort(-table(df[[col]])))[1] df[[col]][is.na(df[[col]])] <- mode_val } else if (is.character(df[[col]])) { df[[col]][is.na(df[[col]])] <- "NA" } } ################################################################# # # following the above method to impute, has now changed some of the statistics # # Run summary(df) and compare with the previous statistics # Do you observe any undesired changes? Explain in detail # Are there any more NA's in your file? ################################################################# summary(df) # complete your Week 5 paper

Background.docx

DraftCleansing_Week_41.R

Background.docx

DraftCleansing_Week_41.R

Background.docx