PowerPoint Data Analysis

kshah9286
DataAnalyticsonCOVID.docx

Page | 13

[Type the company name] Data Analytics on COVID-19 Variants Project Report Bs 2/14/2022

Table of Contents Introduction: 4 Data: 5 Methods: 7 Analysis: 8 Conclusion: 10 Appendix: 12

Table of Figures

Figure 1 8

Figure 2 9

Figure 3 9

Figure 4 9

Figure 5 10

Figure 6 11

Introduction:

In this report, we will analyze the dataset of COVID-variants. We will apply the technique of data pre-processing in the dataset. Then, we will apply the technique of exploratory data analysis to view the data in the form of graph. Then, we will apply linear regression model to predict the data in the dataset.

First of all, we import the python libraries and load the dataset. Then, we apply data-preprocessing techniques on the dataset. In data-preprocessing, raw data is converted into the meaningful data. We check the quality of data before applying machine learning. For this purpose we need to remove the irrelevant data. Then, we need to remove the duplicate records from the dataset. After this, we check the data type of the column whether it is correct or not. Then, we need to standardize the data. Here, the mean value is 0 and standard deviation is 1 to scale the values in the dataset. Then, we check the outliers in the dataset. If outlier is present in the dataset, then remove it. After remove the outliers, we need to handle the missing data in the dataset. Then, normalize the data. At the end, we need to encode the categorical data into numeric data. For this purpose, we use label encoding or one hot encoding. In this way, we can clean our raw data into understandable format.

The next step is exploratory data analysis to view the chart and graphs of the dataset. We plot many graphs such as histogram, bar charts, heat map, line graphs, and box plot and frequency table. These graphs help to view the different type of data in the dataset.

The next step is to apply the machine learning model in our dataset. For this purpose, we apply linear regression model to predict the data in our dataset. It can show the actual and predicted value of the dataset. At the end, we find the model accuracy and make the scatter plot.

Data:

The data is all about COVID-variants. It consists of 100416 rows and 6 columns. It tells that the location where variants found. It also tells the date when it founds. It also tells the number of sequences, percentage of sequences and total number of sequences in COVID-19. There are 24 types of COVID variants in our dataset which is as follows:

1. Alpha

2. B.1.1.277

3. B.1.1.302

4. B.1.1.519

5. B.1.160

6. B.1.177

7. B.1.221

8. B.1.258

9. B.1.367

10. B.1.620

11. Beta

12. Delta

13. Epsilon

14. Eta

15. Gamma

16. Iota

17. Kappa

18. Lambda

19. Mu

20. Omicron

21. S:677H.Robin1

22. S:677P.Pelican

23. Non who

24. Others

Here, we tell the description of data of each variable in the data set and also tells the data type of each variable in the data set.

Variable

Description

Data Type

Location

Location where variants found

String

Date

Date when it’s found

Date time

Variants

COVID-19 variants

String

Number of sequences

Number of sequences in COVID-19

Integer

Percentage of sequences

Percentage of sequences in COVID-19

Float

Total number of sequences

Total Number of sequences in COVID-19

Integer

Properties of Data types:

Here, we discuss some properties of the variables.

Variable

Discrete/ Continuous

Categorical/ Ordinal/ Quantitative

Location

Discrete

Quantitative

Date

Discrete

Quantitative

Variants

Discrete

Categorical

Number of sequences

Discrete

Quantitative

Percentage of sequences

Continuous

Quantitative

Total number of sequences

Discrete

Quantitative

Methods:

In this project, the method which we used to train and predict the model is Linear Regression. Linear Regression is the relationship between data-points. It is used to draw the straight line. This straight line helps us to predict the future values. The equation of linear regression is as follows:

Here, X is the independent variable and Y is the dependent variable. The slope of line is “ a” and intercept is b. Now, we build the model on the dataset using linear regression model and predict the future values.

When the input variable is one, it is known as linear regression. And when the input variable is more than one than it is known as multiple linear regression. The equation of multiple linear regression is as follows:

We applied linear regression technique in my dataset to predict the value of total number of sequences in the database. For this purpose, we split the data into test and train dataset with the ratio of 7:3. Then, we build the model and fit the model into train and test data. At the end, we predict the model and print the predicted model.

We can also apply many other machine learning algorithms to predict the model such as KNN, Multiple Linear Regression, Logistic Regression, Decision Tree etc. Different algorithms give different result on the same dataset. The accuracy of the model changes on each model.

Analysis:

Now, we are going to analyze the dataset using exploratory data analysis (EDA) and build the model on the dataset. First of all, we need to pre-process the dataset. Data preprocessing is used to clean the data from the raw data. The steps of data preprocessing is as follows:

· Import Libraries and Read Data

· Remove irrelevant Data

· Remove Duplicate Records

· Check Data types

· Standardize the data

· Investigate the Outliers

· Handle the Missing Data

· Normalize the data

· Encoding Categorical Data

Now, we perform the exploratory data analysis in this dataset. The histogram of the COVID Variants Dataset is shown below:

Figure 1

The bar chart of the COVID Variants Dataset is shown below:

Figure 2

The heat map of the COVID Variants Dataset is shown below:

Figure 3

The line graph of the COVID Variants Dataset is shown below:

Figure 4

The box plot of the COVID Variants Dataset is shown below:

Figure 5

After exploratory data analysis, we want to build the model using linear regression. First of all we need to split the data into training and testing data. Here, we split the train and test data into 70% and 30%. Then, we reshape the train and test data. Then, we build the Linear regression model and fit the model into train and test data. The slope and intercept values of our regression model is 0.11 and 71.64 respectively.

Here, we predict the model. The results of predicted model are shown below:

Conclusion:

To find the conclusion of the dataset, we need to compare the actual and predicted values of the dataset. The actual and predicted values of the dataset are shown below:

We can see that there is a huge difference between actual and predicted values. It is because the values in the dataset are ambiguous. There are many outliers occur in the dataset. So, we need to remove the outliers then train and test the model. The accuracy of the model is shown below:

The model accuracy of the data is not fine. Because the data is ambiguous, the error rate in the dataset is high and we cannot get the good accuracy of the dataset. Now, we create the scatter plot. The scatter plot with Trend line is shown below:

Figure 6

Hence, we conclude that most of the data contains the null values. So, we need to remove the null values and remove the ambiguity of the data. So that, we can get high accuracy of the data using Linear Regression.

Appendix:

APPENDIX A: Linear Regression

Linear Regression is the relationship between X (the independent variable) and Y (the dependent variable). It is used to draw the straight line between X and Y. This straight line helps us to determine the future values. The equation of linear regression is as follows:

The slope of line is “ a” and intercept is b. Now, we build the model on the dataset using linear regression model and predict the future values. When the input variable is one, it is known as linear regression. And when the input variable is more than one than it is known as multiple linear regression. The equation of multiple linear regression is as follows: