Computer Science
Week-3 – System R Supplemental material
1
Recap • R - workhorse data structures
• Data frame
• List
• Matrix / Array
• Vector
• System-R – Input and output
• read() function
• read.table and read.csv
• scan() function
• typeof() function
• Setwd() function
• print()
• Factor variables
• Used in category analysis and statistical modelling
• Contains predefined set value called levels
• Descriptive statistics
• ls() – list of named objects
• str() – structure of the data and not the data itself
• summary() – provides a summary of data
• Plot() – Simple plot 2
Descriptive statistics - continued
• Summary of commands with single-value result. These commands will work on variables containing numeric value.
• max() ---- It shows the maximum value in the vector
• min() ----- It shows the minimum value in the vector
• sum() ----- It shows the sum of all the vector elements.
• mean() ---- It shows the arithmetic mean for the entire vector
• median() – It shows the median value of the vector
• sd() – It shows the standard deviation
• var() – It show the variance 3
Descriptive statistics - single value results - example
temp is the name of the vector containing all numeric values
4
• log(dataset) – Shows log value for each element.
• summary(dataset) –shows the summary of values
• quantile() - Shows the quantiles by default—the 0%, 25%, 50%, 75%, and 100% quantiles. It is possible to select other quantiles also.
Descriptive statistics - multiple value results - example
5
Descriptive Statistics in R for Data Frames
• Max(frame) – Returns the largest value in the entire data frame.
• Min(frame) – Returns the smallest value in the entire data frame.
• Sum(frame) – Returns the sum of the entire data frame.
• Fivenum(frame) – Returns the Tukey summary values for the entire data frame.
• Length(frame)- Returns the number of columns in the data frame.
• Summary(frame) – Returns the summary for each column.
6
Descriptive Statistics in R for Data Frames - Example
7
Descriptive Statistics in R for Data Frames – RowMeans example
8
Descriptive Statistics in R for Data Frames – ColMeans example
9
Graphical analysis - simple linear regression model in R
• Logistic regression is implemented to understand if the dependent variable is a linear function of the independent variable.
• Logistic regression is used for fitting the regression curve.
• Pre-requisite for implementing linear regression:
• Dependent variable should conform to normal distribution
• Cars dataset that is part of the R-Studio will be used as an example to explain linear regression model.
10
Creating a simple linear model
• cars is a dataset preloaded into System-R studio.
• head() function prints the first few rows of the list/df
• cars dataset contains two major columns • X = speed (cars$speed) • Y = dist (cars$dist)
• data() function is used to list all the active datasets in the environment.
• head() functions prints the top few rows of the dataset.
11
• Scatter plot is plotted using scatter.smooth function.
Creating a simple linear model
12
Graphical analysis - Plots
• Confirming to statistical distributions like Normal, Poisson is a pre- requisite for performing many analysis.
• Plotting graphs helps to ascertain the data distribution and confirmation to statistical models.
• Anomalies can be easily spotted during the analysis phase.
• Identifying noise and cleaning up dirty data is achieved with the help of preliminary plots.
13
Graphical analysis – plots with single variable in R • Histograms – Used to display the mode, spread, and symmetry of a set of
data. Mostly, the central tendency and the spread are analyzed using this function. • R Function: hist(y)
• Index Plots – Here, the plot takes a single argument. This kind of plot is especially useful for error checking (type-1 and type-2)
• Time Series Plots – When a period of time is complete, the time series plot can be used to join the dots in an ordered set of y values. • R function: plot.ts(y)
• Pie Charts – Useful to illustrate the proportional makeup of a sample in presentations. • R function: pie(y)
• Some examples are provided in the next few slides…
14
Graphical analysis - Pie-Chart in R
• Pie-chart is used to illustrate the proportional makeup of a sample in presentations
• To indicate each segment of the pie, it is essential to use a label
15
• Histogram is typically used to plot a continuous variable (example: time, time-intervals in the x- axis)
• Data is presented in the form of bins
• Bin size can be changed
• Parts of Histogram
• Title – The title represents the information that is included in the histogram.
• X-axis – The X-axis or the horizontal axis represents the intervals under which the independent measurements lie.
• Y-axis – The Y-axis or the vertical axis provide the number of times a value occurs in the interval that is dependent on the X-axis.
• Bars – The height of the bars represent the number of times the value occurs in the interval. The histograms possessing equal bins should have a uniform width across all the bars.
• The width of the bin determines the spread of the data in that particular bin.
• Each and every bin will have a range associated with it.
Graphical analysis – Histograms in R
16
Histogram – R example
17
Graphical analysis – Plots with two variables • Two types of variables used in R graphical data analysis are
• Response variable : • The response variable is represented in the Y-axis.
• Example: days in the x-axis
• Explanatory variable : • The explanatory variable is represented in the X-axis.
• Example: outside-temperature in the y-axis.
• Frequently used plotting functions in R • plot(x,y) : Plot of y against x.
• barplot(y): Plot of y (vector of multiple values) – one bar per vector value.
18
Type of two variable plots in R
• Scatterplots – When the explanatory variable is a continuous variable. • Example: Distribution of rain (measured in CM) in any city over time.
• Stepped Lines – Used to plot data distinctly and provide a clear view. • Example: Equidistant plots; repeat event every month (employee count)
• Barplots – It shows the heights of the mean values from the different treatments. • Example: Temperatures for any given week
19
Scatter Plot – Weight vs Mileage
• plot(x, y, main, xlab, ylab, xlim, ylim, axes)
• where x is the data present on the horizontal coordinates.
• y is the data that is present on the vertical axis.
• main represents the title of our plot.
• xlab is the label that denotes the horizontal axis.
• ylab is the label for the vertical axis.
• xlim is the limits of x for plotting.
• ylim is the limits of y for plotting.
20
Barplot – R example
21
Saving R graphs to files – some commonly used types • The following devices (file types) are supported in R
• PDF: Adobe acrobat document
• PNG: png bitmap file
• JPEG: jpeg bitmap file
• bmp: general bitmap file
• Tiff: bitmap file
22
23
R Packages
• R has more than 10,000 packages in the CRAN repository.
• Eight packages most common and useful are listed here • dplyr() – Data analysis and wrangling
• This package predominantly uses data frames to perform data analysis
• ggplot2() – Visualization package • This package facilitates the creation of graphics. 7 major types of visualizations are made possible using package.
• tidyr() – Data clean-up • This package helps to convert raw data into the column and row format.
• shiny() – Interactive web application of R • This is a web application which helps in embedding visualizations (graphs, plots and charts) into Web application.
• caret() – classification and regression training • Regression and classification problems can be modeled using this package.
• E1071() – clustering • Naïve Bayes and clustering algorithms are generally implemented in R using this package
• plotly() – Extension of Javascript library for building graphs. • Provides extensions to JSP libraries, and makes the development of graphical web applications easier.
• tidyquant() – Quantitative financial analysis package. • Widely used by the fund managers, this package helps in analyzing financial data.
24
25