Computer Science

profiledhkth54
Week-3SystemR-supplementalmaterial.pdf

Week-3 – System R Supplemental material

1

Recap • R - workhorse data structures

• Data frame

• List

• Matrix / Array

• Vector

• System-R – Input and output

• read() function

• read.table and read.csv

• scan() function

• typeof() function

• Setwd() function

• print()

• Factor variables

• Used in category analysis and statistical modelling

• Contains predefined set value called levels

• Descriptive statistics

• ls() – list of named objects

• str() – structure of the data and not the data itself

• summary() – provides a summary of data

• Plot() – Simple plot 2

Descriptive statistics - continued

• Summary of commands with single-value result. These commands will work on variables containing numeric value.

• max() ---- It shows the maximum value in the vector

• min() ----- It shows the minimum value in the vector

• sum() ----- It shows the sum of all the vector elements.

• mean() ---- It shows the arithmetic mean for the entire vector

• median() – It shows the median value of the vector

• sd() – It shows the standard deviation

• var() – It show the variance 3

Descriptive statistics - single value results - example

temp is the name of the vector containing all numeric values

4

• log(dataset) – Shows log value for each element.

• summary(dataset) –shows the summary of values

• quantile() - Shows the quantiles by default—the 0%, 25%, 50%, 75%, and 100% quantiles. It is possible to select other quantiles also.

Descriptive statistics - multiple value results - example

5

Descriptive Statistics in R for Data Frames

• Max(frame) – Returns the largest value in the entire data frame.

• Min(frame) – Returns the smallest value in the entire data frame.

• Sum(frame) – Returns the sum of the entire data frame.

• Fivenum(frame) – Returns the Tukey summary values for the entire data frame.

• Length(frame)- Returns the number of columns in the data frame.

• Summary(frame) – Returns the summary for each column.

6

Descriptive Statistics in R for Data Frames - Example

7

Descriptive Statistics in R for Data Frames – RowMeans example

8

Descriptive Statistics in R for Data Frames – ColMeans example

9

Graphical analysis - simple linear regression model in R

• Logistic regression is implemented to understand if the dependent variable is a linear function of the independent variable.

• Logistic regression is used for fitting the regression curve.

• Pre-requisite for implementing linear regression:

• Dependent variable should conform to normal distribution

• Cars dataset that is part of the R-Studio will be used as an example to explain linear regression model.

10

Creating a simple linear model

• cars is a dataset preloaded into System-R studio.

• head() function prints the first few rows of the list/df

• cars dataset contains two major columns • X = speed (cars$speed) • Y = dist (cars$dist)

• data() function is used to list all the active datasets in the environment.

• head() functions prints the top few rows of the dataset.

11

• Scatter plot is plotted using scatter.smooth function.

Creating a simple linear model

12

Graphical analysis - Plots

• Confirming to statistical distributions like Normal, Poisson is a pre- requisite for performing many analysis.

• Plotting graphs helps to ascertain the data distribution and confirmation to statistical models.

• Anomalies can be easily spotted during the analysis phase.

• Identifying noise and cleaning up dirty data is achieved with the help of preliminary plots.

13

Graphical analysis – plots with single variable in R • Histograms – Used to display the mode, spread, and symmetry of a set of

data. Mostly, the central tendency and the spread are analyzed using this function. • R Function: hist(y)

• Index Plots – Here, the plot takes a single argument. This kind of plot is especially useful for error checking (type-1 and type-2)

• Time Series Plots – When a period of time is complete, the time series plot can be used to join the dots in an ordered set of y values. • R function: plot.ts(y)

• Pie Charts – Useful to illustrate the proportional makeup of a sample in presentations. • R function: pie(y)

• Some examples are provided in the next few slides…

14

Graphical analysis - Pie-Chart in R

• Pie-chart is used to illustrate the proportional makeup of a sample in presentations

• To indicate each segment of the pie, it is essential to use a label

15

• Histogram is typically used to plot a continuous variable (example: time, time-intervals in the x- axis)

• Data is presented in the form of bins

• Bin size can be changed

• Parts of Histogram

• Title – The title represents the information that is included in the histogram.

• X-axis – The X-axis or the horizontal axis represents the intervals under which the independent measurements lie.

• Y-axis – The Y-axis or the vertical axis provide the number of times a value occurs in the interval that is dependent on the X-axis.

• Bars – The height of the bars represent the number of times the value occurs in the interval. The histograms possessing equal bins should have a uniform width across all the bars.

• The width of the bin determines the spread of the data in that particular bin.

• Each and every bin will have a range associated with it.

Graphical analysis – Histograms in R

16

Histogram – R example

17

Graphical analysis – Plots with two variables • Two types of variables used in R graphical data analysis are

• Response variable : • The response variable is represented in the Y-axis.

• Example: days in the x-axis

• Explanatory variable : • The explanatory variable is represented in the X-axis.

• Example: outside-temperature in the y-axis.

• Frequently used plotting functions in R • plot(x,y) : Plot of y against x.

• barplot(y): Plot of y (vector of multiple values) – one bar per vector value.

18

Type of two variable plots in R

• Scatterplots – When the explanatory variable is a continuous variable. • Example: Distribution of rain (measured in CM) in any city over time.

• Stepped Lines – Used to plot data distinctly and provide a clear view. • Example: Equidistant plots; repeat event every month (employee count)

• Barplots – It shows the heights of the mean values from the different treatments. • Example: Temperatures for any given week

19

Scatter Plot – Weight vs Mileage

• plot(x, y, main, xlab, ylab, xlim, ylim, axes)

• where x is the data present on the horizontal coordinates.

• y is the data that is present on the vertical axis.

• main represents the title of our plot.

• xlab is the label that denotes the horizontal axis.

• ylab is the label for the vertical axis.

• xlim is the limits of x for plotting.

• ylim is the limits of y for plotting.

20

Barplot – R example

21

Saving R graphs to files – some commonly used types • The following devices (file types) are supported in R

• PDF: Adobe acrobat document

• PNG: png bitmap file

• JPEG: jpeg bitmap file

• bmp: general bitmap file

• Tiff: bitmap file

22

23

R Packages

• R has more than 10,000 packages in the CRAN repository.

• Eight packages most common and useful are listed here • dplyr() – Data analysis and wrangling

• This package predominantly uses data frames to perform data analysis

• ggplot2() – Visualization package • This package facilitates the creation of graphics. 7 major types of visualizations are made possible using package.

• tidyr() – Data clean-up • This package helps to convert raw data into the column and row format.

• shiny() – Interactive web application of R • This is a web application which helps in embedding visualizations (graphs, plots and charts) into Web application.

• caret() – classification and regression training • Regression and classification problems can be modeled using this package.

• E1071() – clustering • Naïve Bayes and clustering algorithms are generally implemented in R using this package

• plotly() – Extension of Javascript library for building graphs. • Provides extensions to JSP libraries, and makes the development of graphical web applications easier.

• tidyquant() – Quantitative financial analysis package. • Widely used by the fund managers, this package helps in analyzing financial data.

24

25