Data Science and R Language

profilecchhantii1988
2AR.pdf

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Review of Basic Data Analytic Methods Using R

1Module 3: Basic Data Analytic Methods Using R

Module 3: Basic Data Analytic Methods Using R 1

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Review of Basic Data Analytic Methods Using R

Introduction

Upon completion of this week, you should be able to:

• Use basic analytics methods such as distributions, statistical tests and summary operations to investigate a data set.

• Use R as a tool to perform basic data analytics, reporting and basic data visualization.

Module 3: Basic Data Analytic Methods Using R 2

These are the objectives for this week.

Specifically, after completing this module, you should be able to:

• Use the R package as a tool to perform basic data analytics, reporting, and apply basic data visualization techniques to your data.

• Apply basic analytics methods such as distributions, statistical tests and summary operations, and differentiate between results that are statistically sound vs. statistically significant.

• Identify a model for your data and define the null and alternative hypothesis.

Module 3: Basic Data Analytic Methods Using R 2

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Putting the Data Analytics Lifecycle into Practice

• From Lesson 1 you learned a strategy to approach any data analytics problem:

• Phase 1: Discovery

• Phase 2: Data Preparation

• Phase 3: Model Planning (covered in Module 4)

• Phase 4: Model Building

• Phase 5: Communicate Results

• Phase 6: Operationalize

• To begin to analyze the data you need:  1. A tool that allows you to look at the data – that is “R”.

 2. Skill in basic statistics – we’re providing a refresher.

3Module 3: Basic Data Analytic Methods Using R

Lesson 1 presented the data analytics lifecycle. The first three phases represent our initial exploration of our data and the results of that exploration.

In order to begin to analyze the data, you need a way to “look” at the data and a tool to work with and present the data. What does “look” mean here? You need a way to “look” both in terms of basic statistical measure and in creating graphs and plots of that data in order to visualize relationships and patterns. Our tool of choice for this activity is the statistical package, R.

Module 3: Basic Data Analytic Methods Using R 3

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Review of Basic Data Analytic Methods Using R

During this lesson the following topics are covered:

• Using the R Graphical User Interface

• Overview: Getting Data into (and out of) R

• Data Types Used in R

• Basic R Operations

• Basic Statistics

• Generic Functions

Using R to Look at Data – Introduction to R

Module 3: Basic Data Analytic Methods Using R 4

GETTING A HANDLE

ON THE DATA

This lesson covers the topics listed above. The techniques you learn here will allow you to handle your data and get to know it: that is, acquire, parse, and filter your data.

We’ll be using R to process the data and as well as to create basic summary statistics and datasets for analysis.

These processes will allow you to understand what you have, and apply these techniques to any data analytics project.

Module 3: Basic Data Analytic Methods Using R 4

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Five Things to Remember About R

1. (Almost) everything is a object

2. (Almost ) everything is a vector  Example: x <- 3 -- x is a vector of length 1

v <- c(2,4,6,8,10) -- v is a vector of length 5

3. All commands are functions  Example: quit() or q(), not q

4. Some commands produce different output depending…

5. Know your default arguments!

5Module 3: Basic Data Analytic Methods Using R

R is a big, complicated, messy, powerful, extensible framework for computing and graphing statistics. Written as a freeware version of the S language, it’s widespread availability and use have resulted in several vendors supplying R interfaces to their products.

There are five things that you should remember about R. Doing so will help you in thinking about how to work with R, and, more importantly, when R proves stubborn and insists that it doesn’t know what you’re talking about.

First thing to remember is that underneath it all, R is an object-oriented language. That means, for example, that the expression “x <- 3” is actually invoking a function of the x instance: e.g. x.assign(3).

Second, almost everything in R is expressed as a vector or a group of vectors. Although x looks like a scalar variable, it actually has a length of 1. Similarly, using the c()function to combine values into a vector, v is a vector of length 5. Each element of a vector can be addressed by a numerical index (e.g., v[3] returns 6, the 3rd element). In R, the index starts from 1 unlike other object-oriented languages such as C or Java in which the index starts at 0.

Third, all commands in R are actually functions. Hence, you must type in either quit() or q() to exit R. q is a variable within a R workspace. Simply typing in q will provide you with a definition of that function (similar result as str(q)).

Module 3: Basic Data Analytic Methods Using R 5

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Using the RStudio Graphical User Interface

Script Workspace

Plot

Console

7Module 3: Basic Data Analytic Methods Using R

R comes in multiple flavors. The heart of the software is a command-line interface (CLI) that is very similar to the BASH shell in Linux or the interactive versions of scripting language like Ruby or Python.

The Window version of R supports multiple GUIs. The default GUI is invoked by simply invoking the R program either via the command line or via the Windows GUI. Within R, the rcmdr interface offers a more task-oriented view. The Rattle interface is another framework that is more task oriented: a user can load a dataset and automatically perform certain tests. Finally, RStudio provides both a desktop and a Web browser interface. This is the UI that we will be using in this course.

RStudio offers three panes that are fairly common to all R GUIs The upper left pane is for script editing. The lower left pane is the R console itself, where all commands are executed.

The lower right pane is the help screen, invoked by the help(<topic>) command, with which you will become very familiar, as well as tabs for file in the current directory, plots, and a tab that enables you to view which packages are available locally or can be downloaded from CRAN, the comprehensive R archive network. Finally, the upper right pane is unique to RStudio, and offers a table-oriented view of the variables stored in the current R workspace. Clicking on a variable or data structure in the workspace window will display the values of that object in the script window as a separate tab.

Note that many panes have multiple tabs that offer different views on the workspace: take a moment to familiarize yourself with their content during our first lab. Each pane can be grown or shrunk by clicking on the grow boxes in the upper right hand corner of each pane.

Module 3: Basic Data Analytic Methods Using R 7

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Where to get R and RStudio

• First, download and install R  For Windows, go to http://cran.r-project.org/bin/windows/base/

 For Linux, go to http://cran.us.r-project.org/bin/linux/

 For Mac OS X, go to http://cran.us.r-project.org/bin/macosx/

• Then, download and install Rstudio  http://www.rstudio.com/ide/download/

Module 3: Basic Data Analytic Methods Using R 8

Module 3: Basic Data Analytic Methods Using R 8

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Overview: Getting Data Into (and Out of) R

• Getting Data Into R  Type it in (if it’s small)!

 Read from a data file

 Read from a database

• Getting Data Out of R  Save in a workspace

 Write a text file

 Save an object to the file system

 You can save plots as well!

9Module 3: Basic Data Analytic Methods Using R

How do we input data into R? The first method, and sometimes the simplest, is: type the data in! This is a good method for small data sets.

You can always read raw data from a data file using read.table(). There are several help functions for reading delimited data as well as fixed length fields; the scan() function permits reading fields of variable length.

You can also read data in from a database. Both DBI (Java) and ODBC (Microsoft) interfaces are supported. Drivers are available for most popular databases.

Once in, there are several ways to save data from an R workspace. You can save the entire workspace and restore it in a later session. You can also write a R data object (usually a data frame) as a text file with field delimiters. Finally, you can save an R object or objects as a binary file, which can be loaded back into another session.

R also allows you to specify a particular output device, which is the standard way to save the results of a graph or a plot. RStudio allows you to save the graph as an image (.jpg, etc.) directly from the plot window.

Module 3: Basic Data Analytic Methods Using R 9

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Typing Data Into R

10Module 3: Basic Data Analytic Methods Using R

Data can always be created by typing in values. For example, the vector assignment

v <- c(1:10) creates a vector of 10 elements numbered 1 through 10. More complicated data structures can be created by composing that data structure from a group of other data structures. First create an empty data structure, and fill it in via the editor or cut and paste from external files. The R script editor allows tweaking of input, and is easier that editing keystrokes in the console window. Remember that we can transform from one object type to another, so we could read data in as a matrix and use the as.data.frame() function to create the data frame.

The RStudio interface allows you to create an empty data object, and then edit that object via the edit() or fix() functions. The graphic shows the edit screen when a edit() or a fix() function is invoked for a particular variable. When creating a data frame you can create and name your variables as well (for example, LastName (character), etc.).

Module 3: Basic Data Analytic Methods Using R 10

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Getting Data Into R: External Sources

• R supports multiple file formats  read.table()is the main function

• File name can be a URL  read.table(“http://ahost/file.csv”, sep=“,”) is the

same as read.csv(…)

• Can read directly from a database via ODBC interface  mydb <- odbcConnect(“MyPostgresDB”, …)

• R packages exist to read data from Hadoop or HDFS (more later)

Module 3: Basic Data Analytic Methods Using R

Note! R always uses the forward-slash (“/”) character in full file names

“C:/users/janedoe/My Documents/Script.R”

11

R has the ability to read in data in many different formats. The read.table() function is the most used, although there are multiple helper functions such as read.csv(), read.delim() and read.fwf() for reading fixed-length fields. Multiple import functions also exist, including reading in data from SPSS, SAS, Sysstat, and other statistical packages. The file name argument to read.table() can also be a URL: this is useful in reading a data file from the Internet. Consult the help subsystem ( help(read.table) for more options).

R always uses a forward slash “/” as the separator character in full pathnames for files. A file in your documents directory in Windows would be written as “C:/users/janedoe/My Documents/Newscript.R”. This makes script files somewhat more portable at the expense of some initial confusion on the part of Windows users.

Module 3: Basic Data Analytic Methods Using R 11

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Getting Data Out of R

12Module 3: Basic Data Analytic Methods Using R

Options R Code

Save it as part of your workspace (or a different workspace)

save.image(file=“dfm.Rdata”)

save.image() # a .Rdata file

load.image(“dfm.Rdata”)

Save it as a data file write.csv(dfm, file=“dfm.csv”)

Save it as an R object save(Mydata, file=“Mydata.Rdata”)

load(file=“Mydata.Rdata”)

Plots can be saved as images saveplot(filename=“filename.ext”, type=“type”)

R utilizes a workspace that consists of a collection of data objects, code libraries, and named data sets. Each workspace also support multiple environments, although we won’t address this issue further (see the R reference manual for more details).

R libraries that are not automatically loaded can be loaded into the workspace via the library(<dataset>); datasets can be loaded into R via the load(“<dataset>”) command.

Packages that are not part of the standard distribution can be obtained via the install.package(“<packagename>”) command (note the use of double quotes).

Data objects in R can be exported either as .csv file, or in native format (save(<object name> …, file=“<full file path name>”)) (usually with a .Rdata extensions) and then reloaded into the R workspace via a load(file=“full path name”). This will repopulate workspace with that object or objects.

If you choose to save your R workspace, it can be reloaded automatically when R is re-started. Other workspaces can be loaded into R with the load.image() command.

Lastly, plots can be saved to a file using the saveplot() command. Most platforms will allow .jpg and .png, but check your local R documentation for your particular platform.

Module 3: Basic Data Analytic Methods Using R 12

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Data Classification: A Quick Review

Module 3: Basic Data Analytic Methods Using R 13

Data “Noir” Examples

Nominal condo, house, rental

Ordinal hates < dislikes <neutral < likes < loves

Interval 10F colder tomorrow than today

Ratio 5342 > 4321

Some statistical tests require data at the interval level or higher. Other tests assume ordinal or nominal. Make sure you check.

Recall our general classification of the measurement of data. Data can be either nominal, ordinal, interval or ratio level of measurement. Nominal is simply a label, there is no order implied. Ordinal data, on the other hand, does have an implied order. For example, I may assign the values of “Good”, “Better” and “Best” such that best > better, and better > good. However, I don’t know the measure of the distance between each value.

Interval data has a fixed value of distance between each element, but an arbitrary 0 point (temperature is an example of this level of measurement). We can distinguish one element from another, but we can’t say that 30 degrees is ½ as cold as 60 degrees. Ratio data, on the other hand, does have a meaningful zero point (e.g. dollars spent on clothing: $0), and $20 is twice as much money as $10.

Some data can be converted to another form. By encoding an ordinal level of measurement, we can generate a single measure of whether someone approves of something or not. A mean value of 3.5 for a Likert scale (coded 1:5) could be interpreted to imply that generally people were positive about an event, but we don’t know if everyone was mostly neutral or varied between love and hate. In this case, viewing a table of responses would be preferred. We can, however, recode binary values: for example, we can code “female” as 1 and “male” as 0 and determine gender balance.

When choosing a statistical measurement, ensure that you have chosen one that is compatible for your data. Numbers will be calculated in some instances, but the result is misleading at best.

Module 3: Basic Data Analytic Methods Using R 13

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Data Types Used in R

Module 3: Basic Data Analytic Methods Using R 14

Data Types R Code

Numbers, Strings n <- 3 s <- “columbus, ohio”

Vectors levels <- c(“Wow”, “Good”,“Bad”) ratings <- c(“Bad”, “Bad”, “Wow”)

Factors and Lists f <- factor(ratings, levels) l <- list(ratings=ratings,

critics=c(“Siskel”,“Ebert”))

Functions stdev <- function(x) {sd(x)}

The workhorse data types of R are the vector and the data frame. Recall that (almost) everything in R is an object and a vector. Numbers and strings are 1 element vectors (that is length(n) == length(s) is true). Vectors can be numeric (c(1,2,3)) or character (c(“WoW”, “Good”, “Bad”)) or mixed (c(1, “two”, 3)). Mixed vectors are always considered to be character.

Factors are categorical variables. If the available data doesn’t include a particular label, it can be supplied as the 2nd argument to the factor() command. Lists are comprised of a set of named vectors. In the example above, we have defined two character vectors, levels and ratings. We create a factor, f, using ratings as our values and levels as the allowed levels, and then create a list structure using our ratings vector and a new vector for critics.

You can write your own functions in R. You can alias an existing R function as demonstrated in the example above: std(x) simply calls the R function sd() to compute the standard deviation of a vector, or your function can be arbitrarily complex. See help(“function”) in the on-line help for more details.

Module 3: Basic Data Analytic Methods Using R 14

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

R Structured Types

Module 3: Basic Data Analytic Methods Using R 15

Data Types R Code

Matrix - (n*m numeric data frame)

m <- matrix( c(1:3, 11:13), nrow = 2, ncol = 3, byrow = TRUE)

Table – contingency table t <- table(dfm$factor_variable)

Data frames – data sets dfm <- read.csv(“CrimeRatesByStates2005.csv”)

Extracting data xdfm <- dfm[1:3,] ydfm <- dfm[, 3:5] s <- dfm$state

R structured types are the matrix, the table, and the data frame.

The matrix is what you think it is: an N by M array usually consisting of numeric values.

Tables are our old friend contingency tables, especially useful for observing nominal or ordinal data.

Finally, data frames are the real workhorse of R. These structures reflect most directly a dataset view of the world, where each row (record) contains several data fields. Usually rows are ordered by number (1..n) as opposed to tables, where rows are named entities (“High”, “Medium”, Low”).

There are several ways to extract data from a structured type. You can select as subset of rows (dfm[1:10,]) or a subset of column (dfm[,3:4]). You can assign a column to a vector, and that vector will take on the resulting type (numeric, character, etc.) These “slices” can be transformed into other types by using the as.<type> function (e.g. dfm <- as.data.frame(t)).

Why does this matter? There are two reasons:

1. knowing what the class of an R variable is (via class(v)) helps us understand where and when it can be used in a function, or it may need to be converted into a different representation (foo <- (as.data.frame(t…))

2. Knowing the type of the underlying data helps us understand when data conversion is needed. Sometimes what appears to be numeric data is encoded as character strings (“12345” != 12345). Hence, in order to perform certain calculations, we may need to convert data (as.numeric(t$age)).

Module 3: Basic Data Analytic Methods Using R 15

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Basic R Operations on Vectors

Function R Code

Operations on Vectors v <- c(1:10); w <- c(15:24) ; nv <- v * pi ; nw <- w * v

Vector transformations radius <- sqrt( d$area)/ pi) t <- as.table(dfm$factor_variable) pct <- t/sum(t)* 100

Logical Vectors v[ v < 1000 ] ndf <- subset(dfm, d$population < 10000) nv <- v[c(1,2,3,5,8,13)]

Examining data structures dim(dfm); attributes(dfm) ; class(dfm); typeof(dfm)

16Module 3: Basic Data Analytic Methods Using R

Recall that a vector is a 1-dimensional array with a single data type (either character or numeric). We can perform several different transforms on a vector: multiplying each value by a scalar, creating a new vector by multiplying one vector by another, etc. We also can transform the contents of a vector by performing a transform on each element. If I have a vector called d$population, I can create a new vector as radius <- sqrt(d$population)/pi.

An example of this kind of manipulation is illustrated by creating a table using a factor from a larger dataset. This results in a table where each element of the factor has a count of the number of times it appears in that dataset. We can then create another vector containing percentages using the statement pct <- t/sum(t)*100, and create a second row in the tables via the t <- rbind(t,pct).

Logical vectors are created whenever an expression is used as an index. In the case above, a new vector is created with values of TRUE if the value of a particular element of v is < 10000. Any element of v that is marked as true is then added to the new vector. This is useful for creating subsets of larger data sets, as we shall see later on in this module. The subset() function provides another way to create a subset of values; the use of a specific range of indexes can be used as well (here we create a new vector consisting of values corresponding to the first six values of a Fibonacci sequence.)

Module 3: Basic Data Analytic Methods Using R 16

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Descriptive Statistics

Function R Code

View the data head(x); tail(x)

View a summary of the data summary(x)

Compute basic statistics sd(x); var(x); range(x); IQR(x)

Correlation cor(x); cor(d$var1, d$var2)

17Module 3: Basic Data Analytic Methods Using R

One of the first things to consider when receiving a dataset is to validate your assumption. Is the data clean? Does it make sense? I personally use head(ds) and tail(ds) to look at the 1st and last values.

The next command is summary() that provide the minimum, maximum, median, mean and the 1st and 3rd quartile values. (Compare this against the values returned from the fivenum(ds) function.)

Other functions include sd (standard deviation), var (variance), range (low value and high values), and IQR that displays the interquartile range (difference between 1st and 3rd quartiles). The cor() function computes the correlation between variables in the dataset, or, more specifically, the vectors provided as the values of x and y.

Module 3: Basic Data Analytic Methods Using R 17

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Generic Functions

• Also known as method overriding in OO-land

• Specific actions that differ based on the class of the object :

• Good for initial data exploration (more later)

Module 3: Basic Data Analytic Methods Using R 18

Code Function

Plot the variable x plot (x)

Histogram of x hist (x)

Internal structure of x str (x)

R makes use of a number of generic functions (we’ll call them that because they explicitly take an object as their 1st argument, instead of the more OO notation of object.print() ). In a strict OO language, these would be called virtual functions or methods and overridden by each class that wanted to make this capability available (consider the toString() function in Java). Such functions can have multiple parameters that affect their behavior. Review the help(plot) documentation as an example.

Module 3: Basic Data Analytic Methods Using R 18

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge

• Which data structures in R are the most used? Why?

• Consider the cbind() function and the rbind() function that bind a vector to a data frame as a new column or a new row. When might these functions be useful?

Module 3: Basic Data Analytic Methods Using R 19

Please take a moment to answer these questions.

Module 3: Basic Data Analytic Methods Using R 19

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Review of Basic Data Analytic Methods Using R

During this lesson the following topics were covered:

• How to use the R Graphical User Interface

• How to get data into (and out of) R

• Data Types used in R, and the basic R operations

• Basic descriptive statistics

• Using generic functions

Summary

Module 3: Basic Data Analytic Methods Using R 20

This slide contains the key points covered in this lesson. Please take a moment to review them.

Module 3: Basic Data Analytic Methods Using R 20