correlation

profilemaqi0912
week4_slides_061421.pptx

Simple Linear Regression

Chapter 9

Key Points of the Chapter:

Databases contain lots of information, and we’ve covered some basic ways to analyze data and to draw conclusions or insights – e.g. univariate analyses (1 variable), cross-tabs (2, or pairs, of variables). But what do you do if you want to do more sophisticated analyses?

There are methods to use pieces of info as predictors to predict an outcome of interest (response, purchase, cancel, etc.)

Regression is an analytic approach that captures (models) that relationship between variable(s) that are predictive of an outcome of interest

The fundamentals of simple linear regression are the topic for this week, but keep in mind that the fundamentals extend to multiple regression

Chapter 9: Simple Linear Regression

‹#›

‹#›

Is there a relationship between customer income (x) and

customer spend (y)? Can you quantify this relationship?

Can this relationship be expressed as an equation or formula?

Why do we care about having an equation or formula?

Source: Exhibit 9.1-9.2, of text

Income (x) Spend (y)
58 76
42 45
24 26
76 102
33 42
69 97
31 33
46 49
51 52
38 40

Income ($k)

Chapter 9: Simple Linear Regression

Yes, as you did in hw3 by “r”

Yes. We care because it lets us go from “reporting” to “predicting”

‹#›

‹#›

Note to class – Think about the “boost” in analytic power you get that comes from being able to predict. . . .the future. It is much more powerful that “just” being able to report or describe present state. Going from correlation to regression allows us to make this transition.

3

Spend (y) 58 42 24 76 33 69 31 46 51 38 76 45 26 102 42 97 33 49 52 40

Yes, there is a relationship between income and spend

Yes, we can quantify (express, measure, etc.) this via “r”, the correlation

We can also model this as an equation (customer spend is the outcome of interest, customer income is a predictor of this outcome)

Having an equation allows us to understand the relationship and to make predictions

The Equation for this is: Y’ = -15.395 + 1.53(x) or Spend = (1.53*Income) – 15.4

Income (x) Spend (y)
58 76
42 45
24 26
76 102
33 42
69 97
31 33
46 49
51 52
38 40
  Coefficients
Intercept -15.395
X Variable 1 1.529809

Income ($k)

Spend ($)

Chapter 9: Simple Linear Regression

‹#›

‹#›

Note to Class: We’ll go through how we arrived at the particular equation shown here, and how to come up with the regression line.

4

Spend (y) 58 42 24 76 33 69 31 46 51 38 76 45 26 102 42 97 33 49 52 40

How do we get to a regression line that models the relationship?

Hint – Start with something that measures an association between 2 variables

Start with correlation, or “r”

Note that a linear regression equation is also an equation for a line . . . We can write equations in terms of z-scores:

y’ = a + bx Usual form of a straight line equation. Note the terms for intercept (a), and slope (b)

Zy’ = 0 + r Zx Re-written in terms of z-scores, note that intercept is 0 when using z-scores

Zy’ = r Zx Note what happens to Zy’ if r =1, (perfect correlation)

if r = 0, (no correlation)

if r = -1 (perfect negative correlation)

For Income (x) and Spend (y), note what happens when “r” is something other than -1,0, or 1.

What we want to know (show) is that the regression equation, model, line, is equivalent to and based upon our understanding and use of correlations as z-scores:Zy’ = r Zx

Ch 9: Regression: The intuition & the math

‹#›

‹#›

zx zy zxzy
0.701424769 0.79114 0.554925
-0.30061062 -0.44751 0.134527
-1.42790042 -1.20669 1.72303
1.828714576 1.83001 3.346565
-0.86425552 -0.56738 0.490364
1.390324096 1.630227 2.266544
-0.98950994 -0.92699 0.917268
-0.05010177 -0.28769 0.014414
0.263034288 -0.16782 -0.04414
-0.55111946 -0.6473 0.356737
  Coefficients
Intercept -3.2E-16
X Variable 1 0.976027

Zy’ = r Zx

0.976023 correlation

Source: Exhibit 9.1-9.2, data converted to z-scores

Ch 9: Regression: The intuition & the math

See slide note below re: something very cool about getting this equation

‹#›

‹#›

Note to Class: If we take our raw scores, and transform them into z-scores and calculate the correlation as you did in HW3, notice that you can generate the regression equation (via software or algebra), that is equivalent to Zy’ = r Zx. Or in “English”, the regression is that the “predicted Y value (as a zscore) = the correlation multiplied by the known X-value (as a zscore).”

6

0.70142476906215701 -0.300610615312353 -1.427900422733678 1.828714576483482 -0.86425551902301501 1.3903240958196339 -0.98950994206982901 -5.0101769218725403E-2 0.263034288398309 -0.551119461405981 0.791139556618222 -0.44751328455172201 -1.2066876065591059 1.8300096814704321 -0.56738291434236099 1.6302269651526999 -0.92699180371428103 -0.28768711149753501 -0.167817481706896 -0.64729600086945505

To find the regression line that models the relationship between X & Y (without software), here’s what’s happening:

i.e. We want to go from: Zy’ = r Zx to an equation of the form: Y’ = a + bX

(see the “white board” steps that algebraically take us from z-scores to the standard form of an equation for a line, and the “AHA moment”…..

See: file: correlation_to_regression_eq.pdf

Steps as follows:

Rewrite z-scores to raw scores for Y and for X…

Multiply (move) the sigma_y from the left side to the right side of the equation…

Add (move) “y_bar” (average_Y) from the left to the right, to isolate a predicted_y on the left . . .

Rewrite the right side of the equation, distributing terms and getting rid of parentheses . . .

Declare ‘b’ to be equal to part of the expression . . .

Rewrite using ‘b’ …

Declare ‘a’ to be equal to part of the expression . . .

Rewrite using ‘a’ and ‘b’ terms, . . . and we have our regression equation !

Notes:

‘b’ is the slope of the regression line adjusted by ratio of population standard deviations.

When dealing w/z-scores, the slope ‘b’ is just the correlation value, “r”

Ch 9: Regression: The intuition & the math

‹#›

‹#›

What is explained vs. unexplained variation in the context of regression?

If there is no relationship between “x” and “y”, what’s your prediction for “y”?

Answer: The Average, or y’ = $56.2

If there is a relationship, then knowing “x” allows you to predict “y”

From the equation, If income = $69, then predicted spend = $90.18

But we have data where income=$69, spend actually = $97. Not $90 ??

Note that this $7 difference is an unexplained (pun intended) variation

y’ = -15.395 + 1.53(x)

Y-Yavg

Y’-Yavg

Y-Y’

Ch 9: Explained and unexplained variation

‹#›

‹#›

Spend (y) 58 42 24 76 33 69 31 46 51 38 76 45 26 102 42 97 33 49 52 40

How “well” does our regression line (model) fit?

Although it’s a best-fit effort, the regression line will always have unexplained error (unless there’s a perfect correlation). Too “much” unexplained error means it’s not a “good” fit

R2 is the amount of variation in Y that is explained by the predictor X.

It’s a measure of how well the regression line models the data, or the “coefficient of determination”.

Mathematically it is literally the correlation, “R” squared

In the example of income and spend, R2 = 95.26%,

So, a bit more than 95% of the variation in Spend, can be accounted for by knowing Income

… and less than 5% is unexplained by the model. It’s a pretty good fit.

Ch 9: Regression: Assessing Fit

‹#›

‹#›

SQL variants and the latest in databases

State of Art

State of the Art in SQL

The basics (review):

Relational tables decompose information into parts and sub-tables. As you did with Student, Job, and Match info as a design exercise (HW2), and as a preliminary build (HW3). Very cool!

Any good design stores information efficiently, has relations based on keys, and has a structure reflected as a schema

Any RDBMS system uses SQL

‹#›

‹#›

State of the Art in SQL

Typical Commercial databases

Currently, most large scale industrial RDBMS are typically from Oracle, Teradata, Microsoft SQL Server.

These are typically “expensive” to license, but have extensive commercial support.

Newer Open Source databases

PostgreSQL, mySQL, SQLite some popular variations of SQL

As open-source, they are “free” but support is typically the responsibility of the user(s)

‹#›

‹#›

State of the Art in SQL

Newest Leading-Edge Database Systems

Advanced databases combine the power of a BigData paradigm with RDBM’s popularity and/or also take advantage of Open Source

Apache Hive is a BigData environment that uses HiveQL to “translate” Java and MapReduce() into SQL-like statements

Snowflake uses is a newer database technology for the cloud, and uses SnowSQL

‹#›

‹#›

State of the Art in SQL

Important differences among all of these SQL variants:

Fundamentally, SQL works for any and all relational databases (analogous to English as a language that is spoken in many countries, but with some localized variations)

Variations to be aware of typically have to do with:

How tables are joined,

How data are created/added,

How a selection of a few records from the “top” is handled

How a PIVOT table is supported or implemented

‹#›

‹#›

Extend correlation into regression, & build a regression model from data

(See Assignments for details)

HW

Exercise: Using the Email marketing (EMMA) dataset, build a simple regression model that predicts “Total customer value” (Y) from “Tenure” (X). . .

Use the Excel capabilities for Simple Regression (hint: data  scatterplot  fit trendline)

Use the algebra that we walked through to come up with the simple regression equation

You will need the following components to determine the slope ‘b’ and intercept ‘a’ values of the equation:

Correlation, r (you can use Excel’s CORR( ) function to calculate this),

Standard deviations for X and Y (use population standard deviations, stdev.p( ),

Means for X and Y,

What would you conclude about the relationship and ability to predict Value from Tenure?

Note: that answers to ‘1’ and ‘2’ above should all match if calculated correctly.

HW4: Regression Homework

‹#›

‹#›