Final project
ECON 178 WI21: Final Project Guidelines
Instructor: Ying Zhu
TAs: Davide Viviano, Connor Goldstick
©Ying Zhu 2020
Overview of the data
The data is from the 1991 Survey of Income and Program Participation (SIPP). You are provided with 7933 observations.
The sample contains households data in which the reference persons aged 25-64 years old. At least one person is employed, and no one is self-employed. The observation units correspond to the household reference persons.
The data set contains a number of feature variables that you can choose to predict total wealth. The outcome variable (total wealth) and feature variables are described in the next slide.
Dataframe with the following variables
Variable to predict (outcome variable):
• tw: total wealth (in US $). • Total wealth equals net financial assets, including
Individual Retirement Account (IRA) and 401(k) assets, plus housing equity plus the value of business, property, and motor vehicles.
Variables related to retirement (features):
• ira: individual retirement account (IRA) (in US $).
• e401: 1 if eligible for 401(k), 0 otherwise
Financial variables (features):
• nifa: non-401k financial assets (in US $).
• inc: income (in US $).
Variables related to home ownership (features):
• hmort: home mortgage (in US $).
• hval: home value (in US $).
• hequity: home value minus home mortgage.
Other covariates (features):
• educ: education (in years).
• male: 1 if male, 0 otherwise.
• twoearn: 1 if two earners in the household, 0 otherwise.
• nohs, hs, smcol, col: dummies for education: no high- school, high-school, some college, college.
• age: age.
• fsize: family size.
• marr: 1 if married, 0 otherwise.
What is 401k and IRA?
• Both 401k and IRA are tax deferred savings options which aims to increase individual saving for retirement
• The 401(k) plan: • a company-sponsored retirement account where employees can contribute • employers can match a certain % of an employee’s contribution • 401(k) plans are offered by employers -- only employees in companies
offering such plans can participate • The feature variable e401 contains information on the eligibility
• IRA accounts: • Individuals can participate
• No employer matching • The feature variable ira contains IRA account (in US $)
Reference: https://www.investopedia.com/ask/answers/12/401k.asp
Your tasks ● Build a prediction/fitted model to predict total wealth (tw) in US dollars ● Write up a paper, up to 20 pages (not including the code), 11 size font, and 1.5 spacing
○ Introduction ■ Briefly state the objectives of the study
○ Statistical analyses ■ Describe how you apply the tools you have learned from this course to perform the prediction task ■ You should try different methods and compare their prediction performance and interpretability
○ Conclusions ■ Summarize what you have discovered from this project ■ (Optional) Discuss caveats to the conclusions drawn from your analyses
● Bonus points o We kept 20% of the sample on which we are going to run your proposed model and method.
We will rank the students by accuracy of the prediction on that 20% of the sample.
● The project is due on March 18 (by 12:30pm PST). Please submit your paper and code according to the instructions. Late assignment will NOT be accepted except with my prior consent regarding unusual circumstances permitted by University policies (proper documentations will be needed)
Grading policy • First, please follow the policy on academic integrity stated in the syllabus: • You are not allowed to work together with others on the final project and the bonus opportunity; you are not allowed to get any help (including but not
limited to program code) from others on the final project and the bonus opportunity.
• We will use tools to catch any form of plagiarism and cheating. Penalties on cheating include, among others, a failing grade for the course. In addition, the Council of Deans of Student Affairs will impose a disciplinary penalty.
• Every student in ECON178 must read, understand, agree and sign the integrity pledge (https://academicintegrity.ucsd.edu/forms/form-pledge.html) before completing any assignment for ECON178. After you sign the pledge form, a receipt will be emailed to you. Please include this receipt in the submission of your assignment.
• Second, the maximum points (without the bonus points) you can get for the project is 40 points. Your project grade counts 55% of your course grade. Slide 7 provides a break down of the points and how your project is graded.
• Third, there are a maximum of 40 bonus points awarded on the base of how good your out-of-sample prediction is. The best prediction receives 40 points. The second best prediction receives less than 40 points, and so on. The bonus points you earn count 5% of your course grade.
• Fourth, the bonus points can only benefit your final grade. We will curve the grades without the bonus points first. Say if you are in the A bracket, you will stay in the A bracket even if you get zero bonus points. On the other hand, if you are in the A- bracket but you get enough bonus points to move your final grade to the A bracket, then you will get an A in the end.
• Fifth, it is entirely possible that you get the maximum points on the project but zero bonus points. After all, luck may be needed to get a high enough accuracy on the out-of-sample prediction. But as explained above, you will never be penalized for not having luck. Having said this, we still expect harder work is more likely to lead to higher bonus points. So, you should put in your best effort.
Grading
0-10 points 10-30 points 30-40 points
Analysis (50% of total points) analysis is overly simplistic or inappropriate; little or no justification for choices of
analyses is provided
analysis is appropriate; some justification for choices of
analyses is provided
analysis is appropriate and informative; detailed justification
for choices of analyses is provided
Results (25% of total points) Conclusions are missing, incorrect, or not drawn from analysis; plots or tables are
inappropriate
Conclusions are sensible and drawn from analysis; plots or
tables are appropriate
Conclusions are not only drawn from analysis but also insightful;
plots or tables are nicely presented and facilitate
conveying the information
Code (15% of total points) Code doesn't run; or code’s outputs do not match the results
described in the paper
Code runs and code’s outputs mostly match the results
described in the paper
Code runs and code’s outputs match the results described in the paper; codes are neat and
easy to read; no irrelevant code
Paper writing (10% of total points)
Writing is poor, illogical, or incoherent
Writing is mostly logical and coherent
Writing is crystal clear, logical, and coherent
Note: The TAs will give a couple examples in your discussion section on what we mean by “giving justification for choices of analyses”
How to carry out this project?
• Data can be found on Canvas • Download the data and save it in your working directory • To load the data into R, use the code:
data_tr <- read.table("data_tr.txt", header = TRUE, sep = "\t", dec = ".")[,-1]
• Inspecting your data and preliminary analyses • Dependent variable (Y): tw: total wealth (in US $) • Predictors (X): your choice (but please make sensible choices) • Some suggestions: use scatter plots and/or simple linear regressions with OLS to
visualize basic relationships between total wealth and various predictors
• In-depth analyses • What could be the X variables in your prediction exercise? • What methods should you use? (OLS, Ridge, Stepwise selections, Lasso) • How do you select the best prediction/fitted model (K-fold cross validation, Leave-
one-out)
What could be the X variables in your prediction exercise?
● The plain predictors listed on Slide 3 ○ Watch out for perfect collinearity: You do not want to include predictors that are perfect collinear.
■ For example, you don’t want to include hmort (home mortgage), hval (home value), and hequity (home value minus home mortgage) all three at the same time because hequity = hval-hmort. One solution to this – drop hequity from your models
■ As another example of perfect collinearity, say you include the intercept term (a column of “1”s) and all four dummy variables nohs, hs, smcol, col (no high-school, high-school, some college, college), note that nohs+hs+smcol+col = columns of 1 (the intercept). One solution to this -- drop one of the education dummies from your models
● Transformations of the plain predictors listed on Slide 3: use what you have learned from Topic 6: Flexible Linear Models
○ Polynomial transformation
○ The spline basis representation
○ Transformation using binary indicators
○ Generalized additive models (GAM)
○ Interacting dummy variables with other variables; for example, age x twoearn
● Before transforming the plain predictors, scatter plots may help you to visualize how each predictor is associated with the total wealth. For example, you may see a nonlinear relationship so you might want to consider some type of polynomial transformation or the spline basis representation
Collection of methods
We have already seen: • OLS • Ridge regressions • Stepwise selection methods • Lasso
Note: 1. In the project, you should select different methods from the list above and
compare their prediction performance and interpretability 2. For Ridge, Stepwise selection, and Lasso, don’t forget the use of Cross-
Validation 3. In addition to prediction performance, you might want to think about
whether the set of predictors used to predict total wealth make intuitive sense
Compare the prediction performances of different methods (an example)
• Partition the ENTIRE data into a training set and test set • Say, you have applied the Ridge regression procedure and the Lasso
procedure • For Ridge, you use the K-fold CV (Slide 12) to choose the best 𝜆 (call it 𝜆𝑅𝑅
∗ ). • For Lasso, you also use the K-fold CV (Slide 12) to choose the best 𝜆 (call it 𝜆𝐿
∗ ). • 𝜆𝑅𝑅
∗ doesn’t necessarily equal to 𝜆𝐿 ∗
• Which method do you choose? Ridge or Lasso? • You use Ridge with 𝜆𝑅𝑅
∗ and Lasso with 𝜆𝐿 ∗ , respectively, to predict the outcomes
with the predictors in the test set, and compute the 𝑀𝑆𝐸𝑡𝑒 (also called MSPE) • If MSEte 𝜆𝑅𝑅
∗ is substantially larger than MSEte 𝜆𝐿 ∗ , choose Lasso; otherwise,
choose Ridge • If MSEte 𝜆𝑅𝑅
∗ and MSEte 𝜆𝐿 ∗ are similar, choose one that you feel the resulting
fitted model is easier to understand (e.g., one that with fewer predictors and the predictors are intuitive)
K-fold cross validation
1. Partition the training data 𝑇 into 𝐾 separate sets of equal size • 𝑇 = (𝑇1,𝑇2,…,𝑇𝐾); e.g., K = 5 𝑜𝑟 10
2. For a given 𝜆 and each 𝑚 = 1,2,…,𝐾, estimate the model with all data excluding 𝑇𝑚 • Denote the obtained model by መ𝑓−𝑚,𝜆(⋅)
3. Predict the outcomes for 𝑇𝑚 with the model from Step 2 and the input data in 𝑇𝑚 • The predicted outcomes are መ𝑓−𝑚,𝜆 𝑥 where 𝑥 ∈ 𝑇𝑚
4. Compute the sample mean squared (prediction) error for 𝑇𝑚, known as the CV prediction error:
• 𝐶𝑉𝑒𝑟𝑟−𝑚 𝜆 = 𝑇𝑚 −1 σ 𝑥,𝑦 ∈𝑇𝑚 𝑦 −
መ𝑓−𝑚,𝜆 𝑥 2
5. Compute the average of 𝑆𝑀𝑆𝐸 over all 𝐾 sets for each 𝜆 • av𝑔𝐶𝑉𝑒𝑟𝑟 𝜆 = 𝐾−1 σ𝑚=1
𝐾 𝐶𝑉𝑒𝑟𝑟−𝑚 𝜆
6. Select 𝜆 = 𝜆∗ that gives the smallest av𝑔𝐶𝑉𝑒𝑟𝑟 𝜆
You can use the code from the discussion sections…
Lastly,
• Please do not leave the project to the last minute. Start early
• Both TAs and I will be happy to answer your questions about the project
Copyright
• My pre-recorded video lectures are protected by U.S. copyright law and by University policy. I am the exclusive owner of the copyright in those materials I create.
• You may take notes and make copies of course materials for your own use. You may also share those materials with another student who is enrolled in or auditing this course.
• You may not reproduce, distribute or display (post/upload) lecture notes or recordings or course materials in any other way — whether or not a fee is charged — without my express prior written consent. You also may not allow others to do so. If you do so, you may be subject to student conduct proceedings under the UC San Diego Student Code of Conduct.
• Similarly, you own the copyright in your original papers. If I am interested in posting your answers or papers on the course web site, I will ask for your written permission.