Lab5

profilerenre
Lab5.docx

Please use data in Flights.jmp

Predicting flight delays can be useful to a variety of organizations: airport authorities, airlines, aviation authorities. At times, joint task forces have been formed to address the problem. Such an organization, if it were to provide ongoing real-time assistance with flight delays, would benefit from some advance notice about flights likely to be delayed.

Airlines will also be very happy to utilize this new information, as it can help them retain customers. A customer who was alerted of a flight delay with enough warning can change their flight, rather than simply canceling it. Flight delays do wind up costing airlines money, but this advance notice can allow them to salvage more customer relationships. If the airlines can be made aware of delays earlier in the process, not only can they avoid these last-minute pileups, but they can keep their customers happier.

The goal is to predict accurately whether a new flight, not in this dataset, will be delayed. The outcome variable is a variable called Flight Status, and it’s status is denoted as delayed or on time. Please use the data noted in Flights.jmp

Variable/Column name

Description

FL_DATE

Flight date

FL_NUM

Flight number

WEEKDAY

BusinessDay or Weekend

WEATHER

1 if there was a weather-related delay. 0 if not.

CRS_DEP_TIME

Scheduled departure time

SCHEDULED_DEPARTURE

Scheduled departure time period: morning, noon, after2pm and evening

ACTUAL_DEP_TIME

Actual departure time

ORIGIN

Origin

DEST

Destination

CARRIER

Carrier

FLIGHT_STATUS

Flight status: 1 = on time. 0 = delayed

The first 1500 rows have been designated as Training Data. The remaining 701 rows are Testing Data.

Build a Decision Tree Model for Question 1-4 to predict FLIGHT_STATUS using the following 5 variables:

WEEKDAY

SCHEDULED_DEPARTURE

ORIGIN

DEST

CARRIER

1. From the Decision Tree, which destination has the highest probability of being on time?

1. From the Decision Tree, describe the characteristics of the flights which are most likely to be delayed?

1. JMP stopped splitting the decision tree at a certain point. What would happen to the rsquare of the training data if we split the tree one additional time?

1. In looking at the model, how many people within the Decision Tree fit the following characteristics: Carrier (UA, US, DL), DEST (JFK), Scheduled_Departure (After2pm. Evening)?

Build a Linear Classification Model Using Stepwise Regression for Question 5-9 using a threshold of 0.78 to predict FLIGHT_STATUS using the following 5 variables:

WEEKDAY

SCHEDULED_DEPARTURE

ORIGIN

DEST

CARRIER

1. What is the P-Value of the model?

1. What is the accuracy of the model?

1. If we wanted to increase the accuracy of the model, what could we do?

1. Assume the following:

· Every time a flight is correctly predicted to be delayed, there is a gain of $100.

· Every time a flight is predicted to be delayed, but is on time there is a loss of $200.

· Every time a flight is predicted to be on time, but is delayed, there is a loss of $1000.

· Every time a flight is correctly predicted to be on time, there is no gain.

Based on the assumptions above, how much of a gain/loss would this model give us (assume only testing data is used to make this calculation).

Please use data in salaries.xlsx and satisfaction.txt

The city of San Diego compensates its employees with a combination of base payment, overtime payment, bonus payment, and benefits. Each year the city evaluates a subset of employees, recording their compensation information, results of a performance evaluation, and results of a satisfaction survey. The city would like to use these data to understand if and how certain compensation structures influence employee performance and employee satisfaction.

The file “salaries.xlsx” contains salary and performance evaluation data on each employee.

· ID – Identification number employee

· Year – Year the employee was evaluated

· Employee name – Name of the employee

· JobTitle – Employee’s Position/Job Title

· BasePay – ($) Amount the employee earned in the year they were evaluated

· OvertimePay – ($) Amount the employee earned in overtime in the year they were evaluated

· BonusPay – ($) Amount the employee earned in bonus payments in the year they were evaluated

· Benefits – (binary) Whether the employee received health benefits in the year they were evaluated

· Performance – (1-5 stars) Performance score of the employee for the year they were evaluated

The file “satisfaction.txt” contains data on each employee’s satisfaction.

· ID – Identification number employee

· Satisfaction Level – (1-5 stars) Job satisfaction as reported by each employee in the survey for the year they were evaluated

a. How many different unique job titles are there in the dataset?

b. How much money did the city pay in bonuses to employees who only had a 1 or 2 performance rating? Provide the total sum of bonus pay.

c. For each performance rating (1-5 stars), compute the average of employee satisfaction for employees with that performance rating. Is there a relationship here? Most managers argue that the more satisfied someone is with their job, the higher they perform. Does this data support that conclusion? Explain.

d. Starting in 2012, a number of initiatives were introduced to increase job performance amongst employees. As a result, a manager believes that overall, employees are more satisfied with their jobs in 2014, in comparison to 2011. Is there evidence to support this claim? Use A/B testing to justify your response quantitatively (at a 5% significance level)?

e. Repeat the question above, but do so separately for employees who received benefits vs. those who did not. Are benefits a confounding variable (at a 5% significance level)?

f. A consulting team is hired to develop a metric for evaluating ‘potential for a promotion’ amongst employees. They suggest evaluating every employee annually, and calculating the following ratio: ((Job Performance * 0.6) + (Job Satisfaction * 0.2)). Employees with growth of 1% on average over a 5 year period are fast-tracked for promotion. Those who grow at a rate of 0.99% or lower are re-evaluated and potentially assigned to new roles. If more than half the employees do not grow at least a 1% rate, the city is planning on investing in new training programs. Based on your understanding of the fundamentals of a good metric, critique the above approach.