Data Science & Big Data Analy
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R
1Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 1
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R
Introduction Upon completion of this week, you should be able to:
• Use basic analytics methods such as distributions, statistical tests and summary operations to investigate a data set.
• Use R as a tool to perform basic data analytics, reporting and basic data visualization.
Module 3: Basic Data Analytic Methods Using R 2
Module 3: Basic Data Analytic Methods Using R 2
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Putting the Data Analytics Lifecycle into Practice
• From Lesson 1 you learned a strategy to approach any data analytics problem:
• Phase 1: Discovery • Phase 2: Data Preparation • Phase 3: Model Planning (covered in Module 4) • Phase 4: Model Building • Phase 5: Communicate Results • Phase 6: Operationalize
• To begin to analyze the data you need: � 1. A tool that allows you to look at the data – that is “R”. � 2. Skill in basic statistics – we’re providing a refresher.
3Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 3
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R
During this lesson the following topics are covered: • Using the R Graphical User Interface • Overview: Getting Data into (and out of) R • Data Types Used in R • Basic R Operations • Basic Statistics • Generic Functions
Using R to Look at Data – Introduction to R
Module 3: Basic Data Analytic Methods Using R 4
GETTING A HANDLE ON THE DATA
Module 3: Basic Data Analytic Methods Using R 4
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Five Things to Remember About R
1. (Almost) everything is a object
2. (Almost ) everything is a vector � Example: x <- 3 -- x is a vector of length 1
v <- c(2,4,6,8,10) -- v is a vector of length 5
3. All commands are functions � Example: quit() or q(), not q
4. Some commands produce different output depending…
5. Know your default arguments!
5Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 5
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Using the RStudio Graphical User Interface
Script Workspace
Plot
Console
7Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 7
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Where to get R and RStudio
• First, download and install R � For Windows, go to http://cran.r-project.org/bin/windows/base/ � For Linux, go to http://cran.us.r-project.org/bin/linux/ � For Mac OS X, go to http://cran.us.r-project.org/bin/macosx/
• Then, download and install Rstudio � http://www.rstudio.com/ide/download/
Module 3: Basic Data Analytic Methods Using R 8
Module 3: Basic Data Analytic Methods Using R 8
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Overview: Getting Data Into (and Out of) R
• Getting Data Into R � Type it in (if it’s small)! � Read from a data file � Read from a database
• Getting Data Out of R � Save in a workspace � Write a text file � Save an object to the file system � You can save plots as well!
9Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 9
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Typing Data Into R
10Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 10
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Getting Data Into R: External Sources
• R supports multiple file formats � read.table()is the main function
• File name can be a URL � read.table(“http://ahost/file.csv”, sep=“,”) is the
same as read.csv(…) • Can read directly from a database via ODBC interface � mydb <- odbcConnect(“MyPostgresDB”, …)
• R packages exist to read data from Hadoop or HDFS (more later)
Module 3: Basic Data Analytic Methods Using R
Note! R always uses the forward-slash (“/”) character in full file names “C:/users/janedoe/My Documents/Script.R”
11
Module 3: Basic Data Analytic Methods Using R 11
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Getting Data Out of R
12Module 3: Basic Data Analytic Methods Using R
Options R Code
Save it as part of your workspace (or a different workspace)
save.image(file=“dfm.Rdata”) save.image() # a .Rdata file load.image(“dfm.Rdata”)
Save it as a data file write.csv(dfm, file=“dfm.csv”)
Save it as an R object save(Mydata, file=“Mydata.Rdata”)
load(file=“Mydata.Rdata”)
Plots can be saved as images saveplot(filename=“filename.ext”, type=“type”)
Module 3: Basic Data Analytic Methods Using R 12
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Data Classification: A Quick Review
Module 3: Basic Data Analytic Methods Using R 13
Data “Noir” Examples
Nominal condo, house, rental
Ordinal hates < dislikes <neutral < likes < loves
Interval 10F colder tomorrow than today
Ratio 5342 > 4321
Module 3: Basic Data Analytic Methods Using R 13
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Data Types Used in R
Module 3: Basic Data Analytic Methods Using R 14
Data Types R Code
Numbers, Strings n <- 3 s <- “columbus, ohio”
Vectors levels <- c(“Wow”, “Good”,“Bad”) ratings <- c(“Bad”, “Bad”, “Wow”)
Factors and Lists f <- factor(ratings, levels) l <- list(ratings=ratings,
critics=c(“Siskel”,“Ebert”))
Functions stdev <- function(x) {sd(x)}
Module 3: Basic Data Analytic Methods Using R 14
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
R Structured Types
Module 3: Basic Data Analytic Methods Using R 15
Data Types R Code
Matrix - (n*m numeric data frame)
m <- matrix( c(1:3, 11:13), nrow = 2, ncol = 3, byrow = TRUE)
Table – contingency table t <- table(dfm$factor_variable)
Data frames – data sets dfm <- read.csv(“CrimeRatesByStates2005.csv”)
Extracting data xdfm <- dfm[1:3,] ydfm <- dfm[, 3:5] s <- dfm$state
Module 3: Basic Data Analytic Methods Using R 15
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Basic R Operations on Vectors
Function R Code
Operations on Vectors v <- c(1:10); w <- c(15:24) ; nv <- v * pi ; nw <- w * v
Vector transformations radius <- sqrt( d$area)/ pi) t <- as.table(dfm$factor_variable) pct <- t/sum(t)* 100
Logical Vectors v[ v < 1000 ] ndf <- subset(dfm, d$population < 10000) nv <- v[c(1,2,3,5,8,13)]
Examining data structures dim(dfm); attributes(dfm) ; class(dfm); typeof(dfm)
16Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 16
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Descriptive Statistics
Function R Code
View the data head(x); tail(x)
View a summary of the data summary(x)
Compute basic statistics sd(x); var(x); range(x); IQR(x)
Correlation cor(x); cor(d$var1, d$var2)
17Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 17
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Generic Functions
• Also known as method overriding in OO-land
• Specific actions that differ based on the class of the object :
• Good for initial data exploration (more later)
Module 3: Basic Data Analytic Methods Using R 18
Code Function
Plot the variable x plot (x)
Histogram of x hist (x)
Internal structure of x str (x)
Module 3: Basic Data Analytic Methods Using R 18
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Check Your Knowledge
• Which data structures in R are the most used? Why? • Consider the cbind() function and the rbind() function that bind
a vector to a data frame as a new column or a new row. When might these functions be useful?
Module 3: Basic Data Analytic Methods Using R 19
Module 3: Basic Data Analytic Methods Using R 19
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R
During this lesson the following topics were covered: • How to use the R Graphical User Interface • How to get data into (and out of) R • Data Types used in R, and the basic R operations • Basic descriptive statistics • Using generic functions
Summary
Module 3: Basic Data Analytic Methods Using R 20
This slide contains the key points covered in this lesson. Please take a moment to review them.
Module 3: Basic Data Analytic Methods Using R 20