Analyzing and visualizing Data - Overview with 12 slides
School of Computer & Information Sciences
ITS530 Analyzing and Visualizing Data
Introduction: R Advanced Graphs ggplot 2
ITS530 R Advanced Graphs
1
What is ggplot2?
ITS530 R Advanced Graphs
2
Grammar of graphics:
Independently specify plot building blocks
combine them to create a graphical display to your liking.
Building blocks of a graph include:
data
aesthetic mapping
geometric object
statistical transformations
scales
coordinate system
position adjustments
faceting
What is a geom
Use a geom to
represent data points,
geom’s aesthetic properties to represent variables.
each function returns a layer.
ITS530 R Advanced Graphs
3
ITS530 R Advanced Graphs
4
ITS530 R Advanced Graphs
5
2 Variable
3 variable
ITS530 R Advanced Graphs
6
2 variable contd.
Different Geoms
You can get a list of available geometric objects using the code below:
>help.search("geom_", package = "ggplot2")
ITS530 R Advanced Graphs
7
| ggplot2::geom_abline | Reference lines: horizontal, vertical, and diagonal |
| ggplot2::geom_bar | Bars charts |
| ggplot2::geom_bin2d | Heatmap of 2d bin counts |
| ggplot2::geom_blank | Draw nothing |
| ggplot2::geom_boxplot | A box and whiskers plot (in the style of Tukey) |
| ggplot2::geom_contour | 2d contours of a 3d surface |
| ggplot2::geom_count | Count overlapping points |
| ggplot2::geom_density | Smoothed density estimates |
| ggplot2::geom_density_2d | Contours of a 2d density estimate |
| ggplot2::geom_dotplot | Dot plot |
| ggplot2::geom_errorbarh | Horizontal error bars |
| ggplot2::geom_hex | Hexagonal heatmap of 2d bin counts |
| ggplot2::geom_freqpoly | Histograms and frequency polygons |
| ggplot2::geom_jitter | Jittered points |
| ggplot2::geom_crossbar | Vertical intervals: lines, crossbars & errorbars |
| ggplot2::geom_map | Polygons from a reference map |
| ggplot2::geom_path | Connect observations |
| ggplot2::geom_point | Points |
| ggplot2::geom_polygon | Polygons |
| ggplot2::geom_qq | A quantile-quantile plot |
| ggplot2::geom_quantile | Quantile regression |
| ggplot2::geom_ribbon | Ribbons and area plots |
| ggplot2::geom_rug | Rug plots in the margins |
| ggplot2::geom_segment | Line segments and curves |
| ggplot2::geom_smooth | Smoothed conditional means |
| ggplot2::geom_spoke | Line segments parameterised by location, direction and distance |
| ggplot2::geom_label | Text |
| ggplot2::geom_raster | Rectangles |
| ggplot2::geom_violin | Violin plot |
| ggplot2::update_geom_defaults | Modify geom/stat aesthetic defaults for future plots |
Importing csv dataframes
setwd(“/Users/…./UC/ITS530/Rprog”)
dpc <- read.csv("dataset_price_personal_computers.csv", na.string="")
Commands:
read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
na.string=“”
a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance.
ITS530 R Advanced Graphs
8
Factor command in r
dpc <-read.csv ("dataset_price_personal_computers.csv", na.strings ="")
library(ggplot2)
table(dpc$ram)
table(dpc$price)
str(dpc)
# factors
dpc$speed <- as.factor(dpc$speed)
dpc$hd <- as.factor(dpc$hd)
dpc$ram <- as.factor(dpc$ram)
dpc$screen <- as.factor(dpc$screen)
dpc$cd <- as.factor(dpc$cd)
dpc$multi <- as.factor(dpc$multi)
dpc$premium <- as.factor(dpc$premium)
ITS530 R Advanced Graphs
9
str(dpc)
'data.frame': 6259 obs. of 11 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ price : int 1499 1795 1595 1849 3295 3695 1720 1995 2225 2575 ...
$ speed : int 25 33 25 25 33 66 25 50 50 50 ...
$ hd : int 80 85 170 170 340 340 170 85 210 210 ...
$ ram : int 4 2 4 8 16 16 4 2 8 4 ...
$ screen : int 14 14 15 14 14 14 14 14 14 15 ...
$ cd : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
$ multi : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ premium: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
$ ads : int 94 94 94 94 94 94 94 94 94 94 ...
$ trend : int 1 1 1 1 1 1 1 1 1 1 ...
str(dpc)
'data.frame': 6259 obs. of 11 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ price : int 1499 1795 1595 1849 3295 3695 1720 1995 2225 2575 ...
$ speed : Factor w/ 6 levels "25","33","50",..: 1 2 1 1 2 4 1 3 3 3 ...
$ hd : Factor w/ 59 levels "80","85","100",..: 1 2 9 9 24 24 9 2 11 11 ...
$ ram : Factor w/ 6 levels "2","4","8","16",..: 2 1 2 3 4 4 2 1 3 2 ...
$ screen : Factor w/ 3 levels "14","15","17": 1 1 2 1 1 1 1 1 1 2 ...
$ cd : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
$ multi : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ premium: Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
$ ads : int 94 94 94 94 94 94 94 94 94 94 ...
$ trend : int 1 1 1 1 1 1 1 1 1 1 ...
Before
factor
After
Factor
command
ggplot() – geom_bar()
#bar plots
ggplot(dpc, aes(x=ram)) + geom_bar()
ggplot(dpc, aes(x=trend)) + geom_bar()
ggplot(dpc, aes(x=ram)) + geom_bar()
ggplot(dpc, aes(x=ram)) + theme_bw() + geom_bar()
ggplot(dpc, aes(x=ram)) + theme_bw() + geom_bar() + labs(x="Ram (GB)", y="Counts", title="Computer Ram")
ggplot(dpc, aes(x=ram, fill=screen)) + theme_bw() + geom_bar() + labs(x="Ram (GB)", y="Counts", title="Computer Ram by Screen")
#Bar plot %
ggplot(dpc, aes(x=ram, fill=screen)) + theme_bw() + geom_bar(position="fill") + labs(x="Ram (GB)", y="Counts by %", title="Computer Ram by Screen")
ITS530 R Advanced Graphs
10
ggplot() – geom_bar, facet command
ITS530 R Advanced Graphs
11
# side by side bars
ggplot(dpc, aes(x=ram, fill=screen)) + theme_bw() + geom_bar(position="dodge") + labs(x="Ram (GB)", y="Counts", title="Computer Ram by Screen")
#drill using facet
ggplot(dpc, aes(x=ram, fill=screen)) + theme_bw() + facet_wrap(~premium) + geom_bar(position="dodge") + labs(x="Ram (GB)", y="Counts", title="Computer Ram by Screen")
#breakdown premium and cd
ggplot(dpc, aes(x=ram, fill=screen)) + theme_bw() + facet_wrap(premium~cd) + geom_bar(position="dodge") + labs(x="Ram (GB)", y="Counts", title="Computer Ram by Screen")
ggplot(), geom_histogram()
ITS530 R Advanced Graphs
12
#histogram
ggplot(dpc, aes(x=price)) + theme_bw() + geom_histogram() + labs(x="Price", y="Freq", title="ComputerPrices")
ggplot(dpc, aes(x=price)) + theme_bw() + geom_histogram(binwidth =10) + labs(x="Price", y="Freq", title="ComputerPrices")
ggplot(dpc, aes(x=price)) + theme_bw() + geom_histogram(binwidth =50) + labs(x="Price", y="Freq", title="ComputerPrices")
ggplot(dpc, aes(x=price)) + theme_bw() + geom_histogram(binwidth =100) + labs(x="Price", y="Freq", title="ComputerPrices")
ggplot(), geom_histogram()
ITS530 R Advanced Graphs
13
#histogram
ggplot(dpc, aes(x=price)) + theme_bw() + geom_histogram(binwidth =100) + labs(x="Price (Binwidth=100)", y="Freq", title="Histogram-Computer Prices")
ggplot(dpc, aes(x=price, fill=ram)) + theme_bw() + geom_histogram(binwidth =100) + labs(x="Price (Binwidth=100)", y="Freq", title="Histogram-Computer Prices")
ggplot(dpc, aes(x=price, fill=screen)) + theme_bw() + geom_histogram(binwidth =100) + labs(x="Price (Binwidth=100)", y="Freq", title="Histogram-Computer Prices")
ggplot(dpc, aes(x=price, fill=screen)) + theme_bw() + facet_wrap(~premium) + geom_histogram(binwidth =100) + labs(x="Price (Binwidth=100)", y="Freq", title="Histogram-Computer Prices")
ggplot() – geom_boxplot()
ITS530 R Advanced Graphs
14
#Box plot
ggplot(dpc, aes(x=screen, y=price)) + theme_bw() + geom_boxplot() + labs(x="Screen", y="Price", title="Box Plot-Computer Screen vs Price")
ggplot(dpc, aes(x=screen, y=price, fill=ram)) + theme_bw() + geom_boxplot() + labs(x="Screen", y="Price", title="Box Plot-Computer Screen vs Price")
ggplot(dpc, aes(x=screen, y=price, fill=ram)) + theme_bw() + facet_wrap(~cd) + geom_boxplot() + labs(x="Screen", y="Price", title="Box Plot-Computer Screen vs Price")
ggplot(), geom_point() “scatter”
#scatter plot
ggplot(dpc, aes(x=price, y=speed)) + geom_point() + theme_bw() + labs(x="Price", y="Speed", title="Speed vs Price")
ggplot(dpc, aes(x=price, y=ram)) + theme_bw() + geom_point() + labs(x="Price", y="RAM", title="RAM vs Price")
ITS530 R Advanced Graphs
15
Questions?
ITS530 R Advanced Graphs
16