Data Science & Big Data Analy

profilemamatha8186
ITS836_02a_RIntro_student4.pdf

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Review of Basic Data Analytic Methods Using R

1Module 3: Basic Data Analytic Methods Using R

Module 3: Basic Data Analytic Methods Using R 1

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Review of Basic Data Analytic Methods Using R

Introduction Upon completion of this week, you should be able to:

• Use basic analytics methods such as distributions, statistical tests and summary operations to investigate a data set.

• Use R as a tool to perform basic data analytics, reporting and basic data visualization.

Module 3: Basic Data Analytic Methods Using R 2

Module 3: Basic Data Analytic Methods Using R 2

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Putting the Data Analytics Lifecycle into Practice

• From Lesson 1 you learned a strategy to approach any data analytics problem:

• Phase 1: Discovery • Phase 2: Data Preparation • Phase 3: Model Planning (covered in Module 4) • Phase 4: Model Building • Phase 5: Communicate Results • Phase 6: Operationalize

• To begin to analyze the data you need: � 1. A tool that allows you to look at the data – that is “R”. � 2. Skill in basic statistics – we’re providing a refresher.

3Module 3: Basic Data Analytic Methods Using R

Module 3: Basic Data Analytic Methods Using R 3

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Review of Basic Data Analytic Methods Using R

During this lesson the following topics are covered: • Using the R Graphical User Interface • Overview: Getting Data into (and out of) R • Data Types Used in R • Basic R Operations • Basic Statistics • Generic Functions

Using R to Look at Data – Introduction to R

Module 3: Basic Data Analytic Methods Using R 4

GETTING A HANDLE ON THE DATA

Module 3: Basic Data Analytic Methods Using R 4

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Five Things to Remember About R

1. (Almost) everything is a object

2. (Almost ) everything is a vector � Example: x <- 3 -- x is a vector of length 1

v <- c(2,4,6,8,10) -- v is a vector of length 5

3. All commands are functions � Example: quit() or q(), not q

4. Some commands produce different output depending…

5. Know your default arguments!

5Module 3: Basic Data Analytic Methods Using R

Module 3: Basic Data Analytic Methods Using R 5

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Using the RStudio Graphical User Interface

Script Workspace

Plot

Console

7Module 3: Basic Data Analytic Methods Using R

Module 3: Basic Data Analytic Methods Using R 7

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Where to get R and RStudio

• First, download and install R � For Windows, go to http://cran.r-project.org/bin/windows/base/ � For Linux, go to http://cran.us.r-project.org/bin/linux/ � For Mac OS X, go to http://cran.us.r-project.org/bin/macosx/

• Then, download and install Rstudio � http://www.rstudio.com/ide/download/

Module 3: Basic Data Analytic Methods Using R 8

Module 3: Basic Data Analytic Methods Using R 8

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Overview: Getting Data Into (and Out of) R

• Getting Data Into R � Type it in (if it’s small)! � Read from a data file � Read from a database

• Getting Data Out of R � Save in a workspace � Write a text file � Save an object to the file system � You can save plots as well!

9Module 3: Basic Data Analytic Methods Using R

Module 3: Basic Data Analytic Methods Using R 9

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Typing Data Into R

10Module 3: Basic Data Analytic Methods Using R

Module 3: Basic Data Analytic Methods Using R 10

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Getting Data Into R: External Sources

• R supports multiple file formats � read.table()is the main function

• File name can be a URL � read.table(“http://ahost/file.csv”, sep=“,”) is the

same as read.csv(…) • Can read directly from a database via ODBC interface � mydb <- odbcConnect(“MyPostgresDB”, …)

• R packages exist to read data from Hadoop or HDFS (more later)

Module 3: Basic Data Analytic Methods Using R

Note! R always uses the forward-slash (“/”) character in full file names “C:/users/janedoe/My Documents/Script.R”

11

Module 3: Basic Data Analytic Methods Using R 11

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Getting Data Out of R

12Module 3: Basic Data Analytic Methods Using R

Options R Code

Save it as part of your workspace (or a different workspace)

save.image(file=“dfm.Rdata”) save.image() # a .Rdata file load.image(“dfm.Rdata”)

Save it as a data file write.csv(dfm, file=“dfm.csv”)

Save it as an R object save(Mydata, file=“Mydata.Rdata”)

load(file=“Mydata.Rdata”)

Plots can be saved as images saveplot(filename=“filename.ext”, type=“type”)

Module 3: Basic Data Analytic Methods Using R 12

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Data Classification: A Quick Review

Module 3: Basic Data Analytic Methods Using R 13

Data “Noir” Examples

Nominal condo, house, rental

Ordinal hates < dislikes <neutral < likes < loves

Interval 10F colder tomorrow than today

Ratio 5342 > 4321

Module 3: Basic Data Analytic Methods Using R 13

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Data Types Used in R

Module 3: Basic Data Analytic Methods Using R 14

Data Types R Code

Numbers, Strings n <- 3 s <- “columbus, ohio”

Vectors levels <- c(“Wow”, “Good”,“Bad”) ratings <- c(“Bad”, “Bad”, “Wow”)

Factors and Lists f <- factor(ratings, levels) l <- list(ratings=ratings,

critics=c(“Siskel”,“Ebert”))

Functions stdev <- function(x) {sd(x)}

Module 3: Basic Data Analytic Methods Using R 14

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

R Structured Types

Module 3: Basic Data Analytic Methods Using R 15

Data Types R Code

Matrix - (n*m numeric data frame)

m <- matrix( c(1:3, 11:13), nrow = 2, ncol = 3, byrow = TRUE)

Table – contingency table t <- table(dfm$factor_variable)

Data frames – data sets dfm <- read.csv(“CrimeRatesByStates2005.csv”)

Extracting data xdfm <- dfm[1:3,] ydfm <- dfm[, 3:5] s <- dfm$state

Module 3: Basic Data Analytic Methods Using R 15

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Basic R Operations on Vectors

Function R Code

Operations on Vectors v <- c(1:10); w <- c(15:24) ; nv <- v * pi ; nw <- w * v

Vector transformations radius <- sqrt( d$area)/ pi) t <- as.table(dfm$factor_variable) pct <- t/sum(t)* 100

Logical Vectors v[ v < 1000 ] ndf <- subset(dfm, d$population < 10000) nv <- v[c(1,2,3,5,8,13)]

Examining data structures dim(dfm); attributes(dfm) ; class(dfm); typeof(dfm)

16Module 3: Basic Data Analytic Methods Using R

Module 3: Basic Data Analytic Methods Using R 16

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Descriptive Statistics

Function R Code

View the data head(x); tail(x)

View a summary of the data summary(x)

Compute basic statistics sd(x); var(x); range(x); IQR(x)

Correlation cor(x); cor(d$var1, d$var2)

17Module 3: Basic Data Analytic Methods Using R

Module 3: Basic Data Analytic Methods Using R 17

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Generic Functions

• Also known as method overriding in OO-land

• Specific actions that differ based on the class of the object :

• Good for initial data exploration (more later)

Module 3: Basic Data Analytic Methods Using R 18

Code Function

Plot the variable x plot (x)

Histogram of x hist (x)

Internal structure of x str (x)

Module 3: Basic Data Analytic Methods Using R 18

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge

• Which data structures in R are the most used? Why? • Consider the cbind() function and the rbind() function that bind

a vector to a data frame as a new column or a new row. When might these functions be useful?

Module 3: Basic Data Analytic Methods Using R 19

Module 3: Basic Data Analytic Methods Using R 19

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Review of Basic Data Analytic Methods Using R

During this lesson the following topics were covered: • How to use the R Graphical User Interface • How to get data into (and out of) R • Data Types used in R, and the basic R operations • Basic descriptive statistics • Using generic functions

Summary

Module 3: Basic Data Analytic Methods Using R 20

This slide contains the key points covered in this lesson. Please take a moment to review them.

Module 3: Basic Data Analytic Methods Using R 20