microeconometric
/* This sample SAS program uses the Mroz data to examine whether the age and number of children affects the wage of married working women. */ /* A 'comment' is any line in your code that is ignored by the compiler. Note that we use '/' followed by '*' to begin a comment in our SAS code; we use a '*' followed by a '/' to end the comment. If you want to run a section of code, you can comment out any portion of your code that you do NOT want compiled. */ /* Generally speaking, the SAS programming language relies on two mechanisms: 'data steps' that are used to read in, manipulate, and merge datasets; and canned 'procedures' that are used to perform specific tasks like sorting data, printing results, calculating means, and performing estimation. When used for econometrics, a typical SAS program will consist of three parts: one or more data steps to read in raw data and create the dataset that will be used for analysis; then one or more procedures that perform the analysis; and finally one or more data steps that convert the output from the procedures into a useful table. */ /* A SAS program is identified using a .sas suffix (hence the name of this file). When you execute a SAS program, it will create a LOG file and a RESULTS file. The LOG file provides step-by-step comments about the compiling of your program. You always want to check the LOG file for any comments that suggest a programming error. The RESULTS file will include all of the any output that is requested (using the PRINT procedure) or automatically generated from the procedure (like a table of regression coefficients from the REG procedure). */ /* Whenever SAS reads in a dataset or creates a new dataset, it will internally store it as a temporary file in the WORK directory unless you deliberately save it as a permanent file. (Permanent files are useful when working with very large files that take a long time to process, but are unnecessary for the small datasets used in this class.) When the program stops executing, all of the temporary files are automatically deleted from WORK. Obviously, the original dataset as well as the LOG and RESULTS files remain. */ /* In SAS, each line of code is ended with a semicolon. As a general practice, it is a good idea to include the command 'run;' after each data step and procedure which will temporarily stop the compiling and execute any code since the last "run" command. */ /* -------------------------------------------------------------------------- PART ONE: READ IN THE DATA AND CREATING NEW VARIABLES -------------------------------------------------------------------------- */ /* The code below uses a data step to read in the mroz.txt file and assign it the name 'mroz'. In SAS, there are many, many acceptable data formats and ways to read in data. Here we use a basic approach that always works: using a DATA step to read in a tab-delimited file. The pathway for the data file is identified during the INFILE statement. If you are unable to read in a data file, it is often due to providing the wrong pathway. In the INFILE statement, "dlm='09'x" identifies the tab delimiter "dsd" tells SAS to not treat consecutive delimiters as a single delimiter (for instances in which a field may be blank) "lrecl" is the longest record length in the data (the default value is 276, so below it is set arbitrarily large at 5000 to ensure that the there is no issue with any dataset used in this course) "firstobs" indicates the row of the first data record (and is set equal to 2 because the first row of the dataset has the field headers) */ /* The list of variable names is provided in the INPUT statement. The default variable format is numeric. If a character variable is read in, the variable name should be followed by ":$w." where 'w' is the maximum field length. (For instance, if one field contains the two-letter abbreviation for a state, it would appear in the INPUT statement as "state :$2.") */ data mroz; infile "~/my_courses/ngoldst4/mroz.txt" dlm='09'x dsd lrecl=5000 firstobs=2; input inlf hours kidslt6 kidsge6 age educ wage repwage hushrs husage huseduc huswage faminc mtr motheduc fatheduc unem city exper nwifeinc; run; /* Two useful diagnostic checks can be performed using the CONTENTS procedure, which lists the variable names and formats in a dataset, and the MEANS procedure, which provides summary statistics about the numeric variables. */ proc contents data=mroz; run; proc means data=mroz; run; /* The DATA step below reads in the temporary "mroz" dataset and creates the "workingonly" dataset that retains observations for women in the workforce and creates the "lwage" and "expersq" variables that are required for the analysis. */ data workingonly; set mroz; where inlf=1; lwage = log(wage); expersq = exper**2; run; /* We can visually checking whether 'workingonly' includes the 'lwage' and 'expersq' variables by using the PRINT procedure to print the data in the RESULTS file. */ proc print data=workingonly; run; /* -------------------------------------------------------------------------- PART TWO: EXECUTE A REGRESSION AND OUTPUTTING THE RESULTS -------------------------------------------------------------------------- */ /* Most SAS procedures have two types of outputs: output that is generated automatically and output that is generated upon request using the Output Delivery System (ODS). For instance, the REG procedure (which executes a single-equation linear regression) automatically generates a table of parameter estimates (called "outest"); it also can generate a slew of ODS tables, including an ANOVA table for the regression or the estimated variance-covariance matrix. A complete list of ODS tables for each procedure is provided in the on-line documentation. In some instances, a procedure has an ODS table that contains the same information as in an automatically-generated table but in a more useful format. For instance, "outest" provides the parameter estimates in a row, whereas the ODS table "ParameterEstimates" provides the parameter estimates and their standard errors in a column. */ /* The REG procedure executes a simple linear regression. The intercept is included by default. "outest = beta_hats1" stores the automatically-generated parameter estimates in a temporary dataset called 'beta_hats1'. "/ acov hccme=0" tells the REG procedure to use robust standard errors (the "acov" is short for asymptotic covariance matrix; the "hccme" is short for heteroskedasticity consistent covariance matrix estimation). "output" creates temporary datasets of basic output from the regression. The output statement below tells the reg procedure to create a dataset called "temp" that includes all of the variables in "workingonly" as well as the fitted residual (called uhat) and the fitted dependent variable (called yhat) "ods output" triggers the Output Delivery System and generates one called ParameterEstimates (which is saved as 'beta_hats2'), one called ANOVA (which is saved as 'anova_table'), and one of the robust variance-covariance matrix called ACovEst (which is saved as 'acov_beta_hats'). */ proc reg data=workingonly outest=beta_hats1; model lwage = exper expersq educ age kidslt6 kidsge6 / acov hccme=0; output out=temp r=uhat p=yhat; ods output ParameterEstimates=beta_hats2 ANOVA = anova_table ACovEst = acov_beta_hats; run; /* We can use the PRINT procedure to print the requested dataset to the RESULTS tab. To print specific variables, use the "var" statement followed by a list of the variables. */ proc print data=beta_hats1; run; proc print data=beta_hats2; run; proc print data=beta_hats2; var Variable Estimate HCStdErr HCProbt; run; /* -------------------------------------------------------------------------- PART THREE: PERFORM HYPOTHESIS TESTS OF PARAMETER RESTRICTIONS -------------------------------------------------------------------------- */ /* The TEST statement tells the REG procedure to perform a test of parameter restrictions. The parameters are identified by their regressor: for instance, "test age=0" specifies a test of whether the coefficient on age is statistically different from zero. */ /* The code below performs a non-robust F test of the joint hypothesis that the parameters on age, kidslt6, and kidsge6 equal zero. A table of test results is automatically generated in the RESULTS file when the TEST statement is invoked. The code also requests the ODS table of test results TestANOVA (which is retained as 'F_test'). */ proc reg data=workingonly; model lwage = exper expersq educ age kidslt6 kidsge6; test age=0, kidslt6=0, kidsge6=0; ods output TestANOVA = F_test; run; proc print data=F_test; run; /* The TEST statement will automatically use a robust Wald test when the robust variance-covariance matrix is used. The corresponding ODS table of test results is 'ACovTestANOVA' (which we name Wald_test). */ proc reg data=workingonly; model lwage = exper expersq educ age kidslt6 kidsge6 / acov hccme=0; test age=0, kidslt6=0, kidsge6=0; ods output ACovTestANOVA = Wald_test; run; proc print data=Wald_test; run; /* The code below performs a Lagrange Multiplier test of the joint significance of the coefficients on age, kidslt6, and kidsge6. */ /* first, perform the restricted regression and retain all of the fields in 'workingonly' as well as the fitted residuals (uhat) in the 'temp' dataset */ proc reg data=workingonly; model lwage = exper expersq educ; output out=temp r=uhat; run; /* second, perform the regressions of each excluded variable on the included variables and add to the 'temp' dataset: (1) regress age on exper, expersq, and educ and save the residuals as 'r1hat' (2) regress kidslt6 on exper, expersq, and educ and save the residuals as 'r2hat' (3) regress kidsge6 on exper, expersq, and educ and save the residuals as 'r3hat' */ proc reg data=temp; model age = exper expersq educ; output out=temp r=r1hat; run; proc reg data=temp; model kidslt6 = exper expersq educ; output out=temp r=r2hat; run; proc reg data=temp; model kidsge6 = exper expersq educ; output out=temp r=r3hat; run; /* third, use a DATA step to generate the regressors uhat*r1hat, uhat*r2hat, uhat*r3hat, and a field of ones */ data temp; set temp; uhat_r1hat = uhat*r1hat; uhat_r2hat = uhat*r2hat; uhat_r3hat = uhat*r3hat; ones = 1; /* fourth, regress 1 on uhat*r1hat, uhat*r2hat, and uhat*r3hat, suppressing the intercept using the 'noint' option. Output the ODS table ANOVA (which we name 'anova_table') which provides the sum of squared residuals */ proc reg data=temp; model ones = uhat_r1hat uhat_r2hat uhat_r3hat / noint; ods output ANOVA = anova_table; run; proc print data=anova_table; run; /* we are interested in the value n-SSR (which is given as 'SS' in 'anova_table') and its p-value. When we print 'anova_table', however, there is no p-value but there is a lot of information we do not need. We can use a DATA step to clean up the results. The code below reads in 'anova_table', uses the WHERE statement to select on the record for which the Source field equals 'Model', renames the sum of squares (SS) as 'ChiSq', and uses the 'probchi' function to generate the p-value for the test. The KEEP option retains the two variables of interest in 'LM_test'. */ data LM_test(keep = ChiSq ProbChiSq); set anova_table; where Source='Model'; ChiSq = SS; ProbChiSq = 1 - probchi(SS,DF); proc print data=LM_test; /* -------------------------------------------------------------------------- PART FOUR: USEFUL CODE FOR MERGING DATASETS -------------------------------------------------------------------------- */ /* In some instances we have variables in different datasets that we want to use together. For instance, we may have coefficient estimates in one table and variable means in another, and we want to join the datasets so that we can calculate an average partial effect. In order to merge two datasets, there must be one or more fields that are common to the datasets and, for at least one of those datasets, uniquely identifies an observation. The code below calculates the mean wage and then merges it back into the 'workingonly' dataset. After using the MEANS procedure to calculate the 'mean_wage' and outputting it to a file called 'temp', a DATA step assigns a variable called 'merge_flag' (which is set equal to 1). The same variable is created in the 'workingonly' dataset and then, in a third DATA step, the MERGE statement is used to join the 'workingonly' and 'temp' datasets using the common 'merge_flag' field. The resulting 'workingonly' dataset now includes a mean wage variable (which is the same for each observation). */ proc means data=workingonly noprint; var wage; output out=temp(keep = mean_wage) mean=mean_wage; data temp; set temp; merge_flag=1; data workingonly; set workingonly; merge_flag=1; data workingonly(drop = merge_flag); merge workingonly temp; by merge_flag; proc print data=workingonly;