Data Visualization

profilesuryasree
Assignment-2.docx

Running Heading: Assignment-2

2

Assignment-2

Developing intimacy with your data

Introduction

The dataset used for this report is imported from the Kaggle Website. The website hosts a repository of many datasets that are available for use under the creative commons license. It is a life expectancy dataset with 22 attributes with a huge volume of data spanning 2,938 rows (data values). This data explores the factors that play a role in the Life Expectancy of the sampled countries. The data were pooled by the World Health Organization (WHO) and the United Nations website. The dataset was uploaded by (Akhil 2019).

Dataset Summary

Name: led.csv

Type: Comma Separated Values

Data Size: 324 kB

Condition: Raw

We can start by visualizing data even in Excel and observer critical factors. Since the data is in raw format, we can use any software that accepts the .csv file extension. We start by visualizing our data in Excel.

The red highlights are a denotation of all the Missing Variables present in our raw data. This gives us a chance just to see without really doing anything on the data.

Figure 1: Visualization from Excel

These empty data cells pose a great difficulty when it comes to data analysis. Therefore, we note down on what action we expect to do on the blank data cells.

On the R-Studio Interface, we shall import the dataset using the following command:On Visualizing the data, we get

Figure 2: Visualization in R

It is clear the blank cells in the dataset are interpreted by R as NA. This is the power of Tibble as compared to the traditional Data Frame.
The physical properties of the dataset are outlined as follows:
	             

The data is organized into 2938 rows and 22 columns, as explained earlier, and the total Sum of missing variables is 2563.

Data Transformation

From the original data,

Figure 3: Summary of Original Data set

The missing variables outlined in figure * and Figure* can be excluded from the dataset by filtering or exclusive omission. As observed, the dataset is organized into a data frame, but apparently, R translates it into a friendlier type. This makes visualization easier, and missing variables can easily be observed by running the view () command. The missing variables can, therefore, be cleaned either by replacing them with Zeros or by the mean values of the columns in which they occur. From the excel Visualization, it is easy to see that I have highlighted the missing variables so that when the cleaning starts, we start with those highlighted.

The omission can be invoked by the following command,

Na.omit(led)

Figure 4: Summary of cleaned data ready for Analysis

The above depicts a clear statistic that on eliminating the missing Values, the Sum of missing variables reduces to Zero.

Upon working on the missing variables, we can go ahead and explore visually how the data is fairing. This is enhanced by our filtering off of the Variables (columns) that we want to operate on. The filtering of the Variables is done using the following command in R.

Valuable data that could be consolidated to the additional data that would be otherwise useful in relaying key insights in the life expectancy would be as follows:

Fertility rate

Country’s Level of industrialization

Demographic patterns

Ages of citizens

Data Exploration

Summaries of the dataset, Summary Statistics can be inferred Since we have the data cleaned. This will make sense as we are excluding observations that never took part in the Experiment at the time the dataset was scraped from the internet. It also gives a key insight into the margins of error expected from running statistics from the dataset.

My exploration would be on the relationship between the relation between alcohol intake and BMI. This is attributed to the fact that the consumption of Alcohol has been believed to be a key factor in the increase in Body Mass Indices. Before starting the Analysis, the Hypotheses would be Alcohol Intake, which leads to an increased BMI. By testing the hypotheses for the same, we can employ line graphs and deduce our expected Linear Relationship. Otherwise, we can use the Karl Person Correlation Coefficients for the truth of the test in excel.

The correlation of the life expectancy between the above attributes, Alcohol, and BMI, would, in some way, give us a clue on what to advise the general Population.

Income composition is another interesting Statistic. Hence, we can use life expectancy and explicitly run the relationships that can lead to informative knowledge of the general trends. We start by Visualizing how the averages and medians vary for the Numerical data.

In R we shall employ the command Summary ()

And the statistics toolbox will be invoked, and an output of the following format is given; -

As seen from the figure above, Quartiles, Means, Number of absent data is articulated. This is done on the Raw data set.

Performing the same on the cleaned data named LED_WHO_DS, we visualize the absence of the missing variables.

Figure 5: A clean data Sample, where Analysis can be done