Analyzing and visualizing Data - Overview with 12 slides

profilesrinivas15
ITS530RDataAnalysis1.pptx

School of Computer & Information Sciences

ITS530 Analyzing and Visualizing Data

Introduction: R Data Analysis

ITS530 R Data Analysis

1

Revision Data Import & Entry

ITS530 R Data Analysis

2

2

Importing csv dataframes

setwd(“/Users/…./UC/ITS530/Rprog”)

pcd <- read.csv("dataset_price_personal_computers.csv", na.string="")

Commands:

read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)

na.string=“”

a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance.

ITS530 R Data Analysis

3

Analyzing imported data frame: pcd

head(pcd)

X price speed hd ram screen cd multi premium ads trend

1 1 1499 25 80 4 14 no no yes 94 1

2 2 1795 33 85 2 14 no no yes 94 1

3 3 1595 25 170 4 15 no no yes 94 1

4 4 1849 25 170 8 14 no no no 94 1

5 5 3295 33 340 16 14 no no yes 94 1

6 6 3695 66 340 16 14 no no yes 94 1

summary(pcd)

X price speed hd ram

Min. : 1 Min. : 949 Min. : 25.00 Min. : 80.0 Min. : 2.000

1st Qu.:1566 1st Qu.:1794 1st Qu.: 33.00 1st Qu.: 214.0 1st Qu.: 4.000

Median :3130 Median :2144 Median : 50.00 Median : 340.0 Median : 8.000

Mean :3130 Mean :2220 Mean : 52.01 Mean : 416.6 Mean : 8.287

3rd Qu.:4694 3rd Qu.:2595 3rd Qu.: 66.00 3rd Qu.: 528.0 3rd Qu.: 8.000

Max. :6259 Max. :5399 Max. :100.00 Max. :2100.0 Max. :32.000

screen cd multi premium ads trend

Min. :14.00 no :3351 no :5386 no : 612 Min. : 39.0 Min. : 1.00

1st Qu.:14.00 yes:2908 yes: 873 yes:5647 1st Qu.:162.5 1st Qu.:10.00

Median :14.00 Median :246.0 Median :16.00

Mean :14.61 Mean :221.3 Mean :15.93

3rd Qu.:15.00 3rd Qu.:275.0 3rd Qu.:21.50

Max. :17.00 Max. :339.0 Max. :35.00

ITS530 R Data Analysis

4

table(pcd$cd)

no yes

3351 2908

tail(pcd)

X price speed hd ram screen cd multi premium ads trend

6254 6254 2154 66 850 16 15 yes no yes 39 35

6255 6255 1690 100 528 8 15 no no yes 39 35

6256 6256 2223 66 850 16 15 yes yes yes 39 35

6257 6257 2654 100 1200 24 15 yes no yes 39 35

6258 6258 2195 100 850 16 15 yes no yes 39 35

6259 6259 2490 100 850 16 17 yes no yes 39 35

Convert text to numbers & correlate

# convert text to numbers

> pcd$cd <- gsub("no", "0",pcd$cd)

> pcd$cd <- gsub("yes", "1",pcd$cd)

> pcd$multi <- gsub("yes", "1", pcd$multi)

> pcd$multi <- gsub("no", "0", pcd$multi)

> pcd$premium <- gsub("no", "0", pcd$premium)

> pcd$premium <- gsub("yes", "1", pcd$premium)

# convert numbers in text format to numeric

> pcdxform <- transform(pcd, cd=as.numeric(cd),multi=as.numeric(multi), premium=as.numeric(premium))

# Perform cross correlation

> cor(pcdxform)

ITS530 R Data Analysis

5

Correlation

# Correlate price, speed, hd, ram

> cor(pcdxform[c("price","speed","hd","ram")])

price speed hd ram

price 1.0000000 0.3009765 0.4302578 0.6227482

speed 0.3009765 1.0000000 0.3723041 0.2347605

hd 0.4302578 0.3723041 1.0000000 0.7777263

ram 0.6227482 0.2347605 0.7777263 1.0000000

> summary(pcdxform[c("price","speed","hd","ram")])

# Create a subset of the dataframe

> pcdxform.sub0 <- subset(pcdxform, select = c("price","speed","hd","ram"))

> summary(pcdxform.sub0)

> cor(pcdxform.sub0)

ITS530 R Data Analysis

6

Subset based on criteria

> pcdxform.sub1 <- subset(pcdxform, price >= 2500, select = c("price","speed","hd","ram"))

> View(pcdxform.sub1)

> cor(pcdxform.sub1)

ITS530 R Data Analysis

7

Questions?

ITS530 R Data Analysis

8