SAS PROGRAMING PROBLEM

profileiucasmingfc3
ECO520.docx

National Adult Tobacco Survey (NATS)

The National Adult Tobacco Survey (NATS) was created to assess the prevalence of tobacco use, as well as the factors promoting and impeding tobacco use among adults. NATS also establishes a comprehensive framework for evaluating both the national and state-specific tobacco control programs.

NATS was designed as a stratified, national, landline, and cell phone survey of non-institutionalized adults aged 18 years and older residing in the 50 states or D.C. It was developed to yield data representative and comparable at both national and state levels. The sample design also aims to provide national estimates for subgroups defined by gender, age, and race/ethnicity.

The Original and Historical Survey Data can be found from : Center for Disease Control and Prevention: NATS. The data and code book is available in

http://bigblue.depaul.edu/jlee141/econdata/health_data/NATS/

Here is the code you want to start:

filename webdat url "http://bigblue.depaul.edu/jlee141/econdata/health_data/NATS/nats_2013_finalcpl_6_weighted.csv" ;

proc import datafile=webdat out = NATS DBMS = csv replace ; run ;

proc sort data=NATS ; by SMOKNOW SMOK100; run;

/*splitting datasets into training (80%) and testing (20%)*/

proc surveyselect data= NATS method=srs seed=YourDePaulID outall

samprate=0.8 out= INDATA ;

strata SMOKNOW SMOK100;

run;

data TRAIN ; set INDATA ;

if selected = 1 ; run ;

data TEST ; set INDATA ;

if selected = 0 ;

SMOKSOMEDAY = . ;

run ;

/* Use TRAIN dataset to answer all questions except

proc contents data=smoke ; run ;

Keep in mind that all negative values are N.A. or missing, so you want to consider to convert them to missing cases. Here are some examples:

This means age is missing if you have age is negative numbers. For SMOK100, you want to defined any cases have negative numbers should be defined missing. Here is an another example on SMOKPERDAY

This case, you can see all negative numbers are missing, and also less than 1 cigarette a day is coded as 666. I don’t believe anyone can smoke 666 cigarettes per day, so this can be possible. I would suggest this as 0 but not missing.

As you can see all these missing cases, you have to make special caution to use data before you make any analysis. The easiest way you can deal with this missing case will be recode or create a new variable to define them are missing. Here is an example:

Data NEWDATA ; set OLDDATA ;

If AGE < 0 then AGE = . ;

If SMOKPERDAY < 0 then SMOKPERDAY = . ;

If SMOKPERDAY = 666 then SMOKPERDAY = 0 ;

/* This case you don’t delete the observations but keep it as zero.

This can be missing or zero but that is your choice */

Run ;

Use TRAIN data to answer the following questions

1. Descriptive Analytics Questions. Make graphs and answer the questions. ( 1 point each)

1) The percentage of current smokers by race, gender, age group, education Level, and geographic.

2) Among the current smokers, who use e-cigarette? Compare the demand for e-cigarette by race, gender, age group, education level.

3) Among the current smokers, who use smokeless tobacco products? Compare the demand for e-cigarette by race, gender, age group, education level.

4) The mean and median cost per pack by state. The top three highest cost states and the lowest cost states.

5) The rank of the most popular brands by smokers including current and past smokers.

6) Who are the most conscious groups in the danger of smoking on health by race, gender, age group, and education level?

7) Who quit smoking or intension to quit smoking. Compare that by race, gender, age group, education level?

8) Suppose you are working as a consultant for Phillip Morris and promote the brand name “Marlboro”. Using descriptive analytics, find the best target group in terms of race, gender, and age group. Fully justify your answers using graphs and descriptive statistics.

Use TRAIN data to answer the following questions

2. Hierarchical and Non-hierarchical Clustering Analysis on ( 1 point each)

1) Demographic clusters: Clusters by age, income group, and education level.

2) Test if the clusters are significant to the number of cigarettes smoked per day, and currently smoke or not.

3) Based on the clusters who is the most important group that the tobacco companies want to promote new cigarette products.

Use only TRAIN data to estimate the models and use the TEST data to perform the out-of-sample prediction and answer all questions */

3. Regression Analysis Questions I (6 points)

1) Find some regression models to find the number of cigarettes smoked per day in 30 days (SMOKSOMDAY) and related variables.

Model 1: Your own choices of variables

Model 2: Your own choices of variables + square of age + any nonlinear variables

Model 3: Stepwise

Model 4: adjusted R square

2) Perform the out of sample prediction using the observations that were not used in the estimation. Find the following statistics and compare the results. Which model is the best performed model in terms of the following statistics using the out of sample prediction?

a. MSE (mean square error)

b. RMSE (root mean square error)

c. MPE (mean percentage error)

d. MAE (mean absolute error)

ALL SAS CODE NEEDS TO BE SAVE AS A txt file

ALL EXPLANATIONS WITH GRAPHS IN A pdf or docx file