AVD7

profileyk1993
Assignment2.docx

Running Head: ANALYZING AND VISUALIZING DATA 2

Analyzing and Visualizing Data 2

Assignment 2

Data Selection: Physical properties of Data set

The 2019 Novel Coronavirus (SARS-2019-nCoV) is a virus identified as the cause of the first respiratory disease outbreak in Wuhan, China (more specifically coronavirus). This data set includes daily information on the number, deaths and recovery of the new coronavirus from 2019. Please note that this is a time series and therefore the cumulative number on any given day is the number of cases. Information on the people affected on a daily basis can give some interesting insights to the wider data science community. Using data from the cases affected, Johns Hopkins University has developed an excellent dashboard. The data is already available as a csv file in Johns Hopkins Github. This data can be used to get insights from the wider DS community at the Kaggle Kernels. The analyzed data in this paper is derived from the Novel Corona Virus 2019 Dataset for Day level information on covid-19 affected cases (Date: September 11, 2020). The main file of this dataset is the Covid 19 data.csv, and the details descriptions (columns) are as follow: Serial Sno.; Observation date; State / Province status (or status of observation); Country / Region (Observation country); Updated UTC time; Cumulative number of confirmed cases to date; and Cumulative tally to this date.

Cleaning or Modifying Existing Data: Count/Tally by Group and Bar Chart in R-Studio

The data of interest in this paper is the daily confirmed cases of 2019 Novel Coronavirus in the United States by state. The Tally () (is a convenient summary wrapper to call n () and sum(n) – (i.e. Calls before and ungroups (the count is similar but group after group) (Dalgaard, 2008). The count was added to the table (add a column n based on the number of items in each existing group and add a count) (this is a shortcut that also makes a group): (insert a column instead of collapsing in each column). The dplyr package also exports dplyr::tal function (Lee and Oliver, 2016). If x is an inheritor class for "tbl" or "data frame," dplyr will be called: tally () (Lee and Oliver, 2016). This makes it easier for the two packages to coexist.

The primary case is a table (possibly multidimensional) described in the formula. Each component of the formula is transformed into one dimension of the cross table for the count table. Conditional proportions or percentages of each of the "secondary" (i.e. conditioning) variables defined as anything other than on the left side of the formula are calculated on the table of proportions or percentages if there is a left side of the formula; and everything except on the right side if the left side of the formula is empty. Note that groups are folded in and become part of the conditioning prior to this definition. When marginal totals are added for each of the conditioning dimensions, the proportion is equal to 1 for each of the conditioning variables (Lee and Oliver, 2016).

Figure 1: Extracted Data from CSV file using Count/Tally Observations by Group in R-Studio

Figure 2: The syntax used to create a bar-chart in R

The format = 'sparse' is used so that a full data frame is first created and unnecessary rows are removed. The data is then represented on bar chart. The script above (Figure 2) creates and save the bar diagram in the current working directory of R. In this case, the bar chart represents the data in rectangle bars with a bar length proportional to the variable value. The bar graph R draw both vertical and horizontal bars. Each bar represents the number of confirmed cases of 2019 Novel Coronavirus per state. The description of the parameters used to develop the bar chart include the following – H (vector or matrix that contains the numeric values used in bar graphs); xlab (a label of x-axis); and Ylab (label of the y-axis) (Lee and Oliver, 2016). Note that a simple bar chart is created using only the input vector and the name of each bar as shown in Figure 3 below.

Figure 3: R Studio derived Bar chart of the cleaned and modified data derived from dataset

References

Dalgaard, P. (2008). Introductory statistics with R. Springer Science & Business Media.

Lee, M. S., & Oliver, P. M. (2016). Count cryptic species in biodiversity tally. Nature, 534(7609), 621-621.

Schumacker, R. E. (2014). Learning statistics using R. SAGE Publications.