Analyzing and visualizing Data - Overview with 12 slides
School of Computer & Information Sciences
ITS530 Analyzing and Visualizing Data
Introduction: R Data Analysis
ITS530 R Data Analysis
1
Revision Data Import & Entry
ITS530 R Data Analysis
2
2
Importing csv dataframes
setwd(“/Users/…./UC/ITS530/Rprog”)
pcd <- read.csv("dataset_price_personal_computers.csv", na.string="")
Commands:
read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
na.string=“”
a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance.
ITS530 R Data Analysis
3
Analyzing imported data frame: pcd
head(pcd)
X price speed hd ram screen cd multi premium ads trend
1 1 1499 25 80 4 14 no no yes 94 1
2 2 1795 33 85 2 14 no no yes 94 1
3 3 1595 25 170 4 15 no no yes 94 1
4 4 1849 25 170 8 14 no no no 94 1
5 5 3295 33 340 16 14 no no yes 94 1
6 6 3695 66 340 16 14 no no yes 94 1
summary(pcd)
X price speed hd ram
Min. : 1 Min. : 949 Min. : 25.00 Min. : 80.0 Min. : 2.000
1st Qu.:1566 1st Qu.:1794 1st Qu.: 33.00 1st Qu.: 214.0 1st Qu.: 4.000
Median :3130 Median :2144 Median : 50.00 Median : 340.0 Median : 8.000
Mean :3130 Mean :2220 Mean : 52.01 Mean : 416.6 Mean : 8.287
3rd Qu.:4694 3rd Qu.:2595 3rd Qu.: 66.00 3rd Qu.: 528.0 3rd Qu.: 8.000
Max. :6259 Max. :5399 Max. :100.00 Max. :2100.0 Max. :32.000
screen cd multi premium ads trend
Min. :14.00 no :3351 no :5386 no : 612 Min. : 39.0 Min. : 1.00
1st Qu.:14.00 yes:2908 yes: 873 yes:5647 1st Qu.:162.5 1st Qu.:10.00
Median :14.00 Median :246.0 Median :16.00
Mean :14.61 Mean :221.3 Mean :15.93
3rd Qu.:15.00 3rd Qu.:275.0 3rd Qu.:21.50
Max. :17.00 Max. :339.0 Max. :35.00
ITS530 R Data Analysis
4
table(pcd$cd)
no yes
3351 2908
tail(pcd)
X price speed hd ram screen cd multi premium ads trend
6254 6254 2154 66 850 16 15 yes no yes 39 35
6255 6255 1690 100 528 8 15 no no yes 39 35
6256 6256 2223 66 850 16 15 yes yes yes 39 35
6257 6257 2654 100 1200 24 15 yes no yes 39 35
6258 6258 2195 100 850 16 15 yes no yes 39 35
6259 6259 2490 100 850 16 17 yes no yes 39 35
Convert text to numbers & correlate
# convert text to numbers
> pcd$cd <- gsub("no", "0",pcd$cd)
> pcd$cd <- gsub("yes", "1",pcd$cd)
> pcd$multi <- gsub("yes", "1", pcd$multi)
> pcd$multi <- gsub("no", "0", pcd$multi)
> pcd$premium <- gsub("no", "0", pcd$premium)
> pcd$premium <- gsub("yes", "1", pcd$premium)
# convert numbers in text format to numeric
> pcdxform <- transform(pcd, cd=as.numeric(cd),multi=as.numeric(multi), premium=as.numeric(premium))
# Perform cross correlation
> cor(pcdxform)
ITS530 R Data Analysis
5
Correlation
# Correlate price, speed, hd, ram
> cor(pcdxform[c("price","speed","hd","ram")])
price speed hd ram
price 1.0000000 0.3009765 0.4302578 0.6227482
speed 0.3009765 1.0000000 0.3723041 0.2347605
hd 0.4302578 0.3723041 1.0000000 0.7777263
ram 0.6227482 0.2347605 0.7777263 1.0000000
> summary(pcdxform[c("price","speed","hd","ram")])
# Create a subset of the dataframe
> pcdxform.sub0 <- subset(pcdxform, select = c("price","speed","hd","ram"))
> summary(pcdxform.sub0)
> cor(pcdxform.sub0)
ITS530 R Data Analysis
6
Subset based on criteria
> pcdxform.sub1 <- subset(pcdxform, price >= 2500, select = c("price","speed","hd","ram"))
> View(pcdxform.sub1)
> cor(pcdxform.sub1)
ITS530 R Data Analysis
7
Questions?
ITS530 R Data Analysis
8