Rstuio Stats Homework

profilejamescorben
02_STA_Homework.pdf

Homework 2 Due: before 12:00 pm (noon) on Tuesday, March 30. Please do not include your name on your write-up, since these documents will be reviewed by anonymous peer graders.

For probability derivations, show your work and/or explain your reasoning. Do not include your raw R code in your write-up unless we explicitly ask for it. You will submit your R script as a separate document to the write-up itself. On Canvas, you will see two assignments pages corresponding to Homework 2: (1) to upload your write-up PDF file and (2) to upload the R script that you used to generate your write-up. Your write-up is what will be peer graded. The R script will not be graded, but you must submit it to receive credit on the write-up.

If you use tables or figures, make sure they are formatted professionally. Figures and tables should have informative captions. Numbers should be rounded to a sensible number of digits (you’re at UT and therefore a smart cookie; use your judgment for what’s sensible depending on the level of precision that is appropriate for the problem context).

Problem 1 - NHANES The American National Health and Nutrition Examination Surveys (NHANES) are collected by the US National Center for Health Statistics, which has conducted a series of health and nutrition surveys since the early 1960s. Since 1999, approximately 5,000 individuals of all ages are interviewed each year. For this problem you will need to install the NHANES package in RStudio with a built-in data frame called NHANES.

library(NHANES) library(mosaic) data(NHANES)

Part A: Create a histogram for the distribution of SleepHrsNight for individuals aged 18-22 (inclusive) via the bootstrap. Use at least 10000 iterations. Include the plot and report the mean sleep hours for this age group. Optional: how does your sleep compare?

Part B: Now we want to build a confidence interval for the proportion of women we think are pregnant at any given time. Bootstrap a confidence interval with 10000 iterations. Include in your write-up a histogram of your simulation results, along with a 95% confidence interval for the proportion. To speed things up, you can use this code to subset the NHANES data frame to one with only women. Let’s get rid of the N/A values for our variable of interest (PregnantNow) in our filtered data frame:

NHANES_women <- NHANES %>% filter(Gender=="female",

!is.na(PregnantNow) )

Problem 2 - Iron Bank The Securities and Exchange Commission (SEC) is investigating the Iron Bank, where a cluster of employees have recently been identified in various suspicious patterns of securities trading. Of the last 2021 trades, 70 were flagged by the SEC’s detection algorithm. Trades are flagged periodically even when no illicit market activity has taken place. For that reason, the SEC often monitors individual and institutional trading but does not investigate detected incidents that may be consistent with random variability in trading patterns.

SEC data suggest that the overall baseline rate of suspicious securities trades is 2.4%.

Are the observed data (70 flagged trades out of 2021) consistent with the SEC’s null hypothesis that, over the long run, securities trades from the Iron Bank are flagged at the same baseline rate as that of other traders?

Use Monte Carlo simulation (with at least 100000 simulations) to calculate a p-value under this null hypothesis. Include the following items in your write-up:

1

• the null hypothesis that your are testing;

• the test statistic you used to measure evidence against the null hypothesis;

• a plot of the probability distribution of the test statistic, assuming that the null hypothesis is true;

• the p-value itself;

• and a one-sentence conclusion about the extent to which you think the null hypothesis looks plausible in light of the data. This one is open to interpretation! Make sure to defend your conclusion.

Problem 3 - Armfold A professor at an Australian university ran the following experiment with her students in a data science class. Everyone in the class stood up, and the professor asked everyone to fold their arms across their chest. Students then filled out an online survey with two pieces of information: 1) Did they fold their arms with the left arm on top of right, or with the right arm on top of the left? 2) Did they identify as male or female? The professor then asked her students to assess whether, in light of the data from the survey, there was support for the idea that males and females differed in how often they folded their arms with their left arm on top of the right. The survey data indicated that males folded their arms with their left arms on top more frequently. But how much more frequently? And was this just a “small-sample” difference? Or did it accurately reflect a population-level trend? The data from this experiment are in armfold.csv. There are two relevant variables:

• LonR_fold: a binary (0/1) indicator, where 1 indicates left arm on top, and 0 indicates right arm on top.

• Sex: a categorical variable with levels male and female.

(There’s also a third variable indicating which hand the student writes with, but we’re not using that here.)

Your task (quite similar to what we did with the recidivism R walkthrough) is to assess support for any male/female differences in the population-wide rate of “left arm on top” folding. Make sure to quantify your uncertainty about how much more often males fold their left arms on top. (That is, it’s not enough to just report the estimate for this sample; you have to provide a confidence interval that tells us how we can expect this number to generalize to the wider population. In doing so, you can treat this sample as if it were a random sample from the relevant population, in this case university students.) Your write-up should include four sections:

1) Question: What question are you trying to answer? 2) Approach: What modeling approach did you use to answer the question? 3) Results: What evidence/results did your modeling approach provide to answer the question? This

might include numbers, figures, and/or tables as appropriate depending on your approach. 4) Conclusion: What is your conclusion about your question? You will want to provide a short written

interpretation of your confidence interval.

Note: for a relatively simple problem like this, each of these four sections will likely be quite short. Nonetheless, these sections reflect a good general organization for a data-science write-up. So we’ll start practicing with this organization on a simple problem, even if it seems a bit overkill at first. (It is certainly possibly in this case for each of them to be only 1 or 2 sentences long. Although you might feel you need more, and although nobody on our end is breaking out a word counter, it shouldn’t be too much longer than that.)

Problem 4 - Ebay In this problem, you’ll analyze data from an experiment run by EBay in order to assess whether the company’s paid advertising on Google’s search platform was improving EBay’s revenue. (It was certainly improving Google’s revenue!)

Google Ads, also known as Google AdWords, is Google’s advertising search system, and it’s the primary way the company made its $162 billion in revenue in fiscal year 2019. The AdWords system has advertisers bid on certain keywords (e.g., “iPhone” or “toddler shoes”) in order for their clickable ads to appear at the top of

2

the page in Google’s search results. These links are marked as an “Ad” by Google, and they’re distinct from the so-called “organic” search results that appear lower down the page.

Nobody pays for the organic search results; pages get featured here if Google’s algorithms determine that they’re among the most relevant pages for a given search query. But if a customer clicks on one of the sponsored “Ad” search results, Google makes money. Suppose, for example, that EBay bids $0.10 on the term “vintage dining table” and wins the bid for that term. If a Google user searches for “vintage dining table” and ends up clicking on the EBay link from the page of search results, EBay pays Google $0.10 (the amount of their bid). 1

For a small company, there’s often little choice but to bid on relevant Google search terms; otherwise their search results would be buried. But a big site like EBay doesn’t necessarily have to pay in order for their search results to show up prominently on Google. They always have the option of “going organic,” i.e. not bidding on any search terms and hoping that their links nonetheless are shown high enough up in the organic search results to garner a lot of clicks from Google users. So the question for a business like EBay is, roughly, the following: does the extra traffic brought to our site from paid search results—above and beyond what we’d see if we “went organic”—justify the cost of the ads themselves? To try to answer this question, EBay ran an experiment in May of 2013. For one month, they turned off paid search in a random subset of 70 of the 210 designated market areas (DMAs) in the United States. A designated market area, according to Wikipedia, is “a region where the population can receive the same or similar television and radio station offerings, and may also include other types of media including newspapers and Internet content.” Google allows advertisers to bid on search terms at the DMA level, and it infers the DMA of a visitor on the basis of that visitor’s browser cookies and IP address. Examples of DMAs include “New York,” “Miami-Ft. Lauderdale,” and “Beaumont-Port Arthur.” In the experiment, EBay randomly assigned each of the 210 DMAs to one of two groups:

• the treatment group, where advertising on Google AdWords for the whole DMA was paused for a month, starting on May 22.

• the control group, where advertising on Google AdWords continued as before.

In ebay.csv you have the results of the experiment. The columns in this data set are:

• DMA: the name of the designated market area, e.g. New York

• rank: the rank of that DMA by population

• tv_homes: the number of homes in that DMA with a television, as measured by the market research firm Nielsen (who defined the DMAs in the first place)

• adwords_pause: a 0/1 indicator, where 1 means that DMA was in the treatment group, and 0 means that DMA was in the control group.

• rev_before: EBay’s revenue in dollars from that DMA in the 30 days before May 22, before the experiment started.

• rev_after: EBay’s revenue in dollars from that DMA in the 30 days beginning on May 22, after the experiment started.

The outcome of interest is the revenue ratio at the DMA level, i.e. the ratio of revenue after to revenue before for each DMA. If EBay’s paid search advertising on Google was driving extra revenue, we would expect this revenue ratio to be systematically lower in the treatment-group DMAs versus the control-group DMAs. On the other hand, if paid search advertising were a waste of money, then we’d expect the revenue ratio to be basically equal in the control and treatment groups.

1There’s huge variability in the market price of different search terms. The market price per click for a search term like "insurance" or "attorney" or "MBA programs" might be $50 or more. For stuff you might buy on EBay, it’s usually a lot less.

3

Two explanatory notes here:

• We use the ratio rather than the absolute difference because the DMAs differ enormously in population and therefore revenue.

• We wouldn’t necessarily expect the before-and-after revenue ratio to be 1 (i.e. similar revenue before and after the experiment), even in the control-group DMAs. That’s because, like any retailer, EBay’s sales exhibit a lot of seasonal patterns and might be lower in some months across the board, regardless of paid search. That’s why the important question isn’t whether the revenue is the same before and after in the treatment-group DMAs, but whether the before-and-after ratio is the same for the treatment group as for the control group. Your task is compute the average treatment effect and provide a confidence interval via bootstrapping to assess the evidence for whether the revenue ratio is the same in the treatment and control groups, or whether instead the data favors the idea that paid search advertising on Google creates extra revenue for EBay. Make sure you use at least 10000 Monte Carlo simulations in your bootstrap simulation. Your write-up should include the sections: 1) Question; 2) Approach; 3) Results; 4) Conclusion as outlined in Problem 4.

Problem 5 - Creatinine Download the data in creatinine.csv. Each row represents one of 150 patients from a particular nephrolo- gist’s office. The variables in this data frame are:

• age: patient’s age in years.

• creatclear: patient’s creatinine clearance rate in mL/minute, a measure of kidney health (a higher rate means better clearance, i.e., more healthy).

Use these data to answer three questions:

A) What creatinine clearance rate should we expect for a 36-year-old? Explain briefly (using one or two sentences and equations) how you determined this estimate.

B) How does creatinine clearance rate change with age? (This should be a single number whose units are ml/minute per year.) Explain briefly (one or two sentences) how you determined this.

C) Who has a creatinine clearance rate that is healthier (higher) for their age: a 45-year-old with a rate of 130, or a 60-year-old with a rate of 120? Explain briefly (using a few sentences and showing your work with equations) how you determined this.

Problem 6 - Orbital Scanner If a Resistance ship enters the atmosphere on the desolate rock planet of Exegol, the probability that the Imperial fleet’s orbital scanner will correctly register its presence is 95%. If there is, in fact, no Resistance ship on Exegol, the scanner will falsely register the presence of a ship with probability 5%. Historical data indicate that there is a 15% probability at any given time that a Resistance ship is on Exegol.

Imagine that you are among the Imperial Guard assigned to the Sith Citadel. Today, the orbital scanner registers the presence of a Resistance ship. What is the probability that a Resistance ship is, in fact, on Exegol? Prepare a short report with this calculation. Don’t forget to show your work.

Problem 7 - Big Mac Index From The Economist: "The Big Mac index was invented by The Economist in 1986 as a lighthearted measure of the extent to which currencies are at their ‘correct’ level. It is based on the theory of purchasing-power parity (PPP), the idea that long-run exchange rates should move towards the rate that equalises the prices of an identical basket of goods and services (e.g., a burger) in any two countries.

4

Burgernomics was never intended as a precise guide of currency misalignment but rather a tool to make exchange-rate theory more digestible. Yet the Big Mac index is now a global standard appearing in economic textbooks and used in academic studies."

Download bigmac.csv to access a data frame with the following variables:

• date: Date of observation • iso_a3: Three-character ISO 3166-1 country code • currency_code: Three-character ISO 4217 currency code • name: Country name • local_price: Price of a Big Mac in the local currency • dollar_ex: Local currency units per dollar • GDP_dollar: GDP per person, in dollars

Let’s use this dataset to construct and analyze a simpler version of the Big Mac Index.

Preprocessing

For each observation, create a new variable for the price of a Big Mac in US Dollars (USD), dividing the local price by the local currency units per dollar. Before you do that, however, it’s good practice to account for any ‘gotchas’ that could happen when dividing—make sure that the local currency units per dollar are all positive, non-zero values.

library(tidyverse) bigmac <- read.csv("bigmac.csv") bigmac <- bigmac %>% filter(dollar_ex > 0) %>% mutate(price_usd = local_price/dollar_ex)

For convenience, let’s also add another variable for the year of the observation date. To do this, use the function lubridate::year() on the date variable like so:

bigmac <- bigmac %>% mutate(year = lubridate::year(date))

Because each year can possibly have multiple observations per country, average the USD price across each country and year to avoid possible duplicates. You should use the dataset you make here for the subsequent problems. Any references to price refer to the average calculated in this step.

avg_bigmac <- bigmac %>% group_by(name, year) %>% summarise(avg_usd = mean(price_usd, na.rm=TRUE))

Part A: Use boxplots to visualize how the distribution of price differs by year. Remember the factor() function for treating numeric variables as categorical.

Based on your plot, what can you say about the price of a Big Mac over time? In your write-up, include an informative caption (a few sentences or a short paragraph) below the plot that identifies the variables and units plotted on the chart and also summarizes main takeaways from the plot. Be sure to comment on measures of both center and spread, as well the presence of outliers.

Part B: What country had the most expensive Big Mac (in US Dollars) and how much was it? In what year was it most expensive?

Part C: Now calculate the actual big mac index for 2021. This is the percentage difference of the price of a Big Mac in a given locale relative to the price of a Big Mac in the United States. I stored my US Big Mac price in a variable for easy reference later (the pull() command returns just numbers, no fancy dataframe stuff):

us_price <- avg_bigmac %>% filter(year==2021, name=="United States") %>% pull(avg_usd)

Take the top 20 values of the index and plot them, sorted ascending or descending. Useful functions here are top_n or slice_max() (see documentation for examples of usage). You may also want to use reorder() where you can pass to your axis aesthetic the label and variable to sort on. For example, if I wanted the

5

country as one of my plot variables, I could use reorder(country, big_mac_index) in place of just country to sort them by the values of my Big Mac index variable. If you are having trouble with slice_max() try piping your dataframe to ungroup() then piping it to slice_max().

Part D: Address the following questions with a few sentences (no more than a short paragraph) in your write-up: - Which country’s currency is most overvalued relative to the United States? By how much? - Are there more overvalued or undervalued currencies, relative to the US? - What problems might you foresee with the index as we’ve currently calculated it?

6

  • Homework 2
    • Problem 1 - NHANES
    • Problem 2 - Iron Bank
    • Problem 3 - Armfold
    • Problem 4 - Ebay
    • Problem 5 - Creatinine
    • Problem 6 - Orbital Scanner
    • Problem 7 - Big Mac Index
      • Preprocessing