StataSE 17 Questions
1
MMD010 Assignment
(November 2021)
Suppose we are interested in popularity of movies (the dependent variable, measured by an index).
Use the movie dataset (Movies_data.csv) and answer the following questions:
1. Check the distribution of Popularity. Visualise it using a histogram and density. Discuss the shape
of the distribution and compare it to Normal Distribution. Make a log transformation if needed and
explain your reasoning.
2. Summarize the following variables (Popularity, Total Cast (number of actors), Total Crew (number
of people involved in movie production) and Genre Adventure (adventure movies)) in a table where
you present Number of observations, Mean, Standard deviation, Minimum and Maximum. Interpret
the results.
3. Create a correlation matrix among the above-mentioned variables (Popularity, Total Cast, Total
Crew, and Genre Adventure) and interpret the results.
4. Construct a t-test to check whether there is a significant difference in popularity among adventure
and non-adventure movies. Explain the findings.
5. Visualise the relationship between Popularity and Total Cast using a scatter plot. Explain what you
observe. [hint: would you use Popularity or ln Popularity?]
6. Based on the hypothesis that more actors involved in movie production leads to higher popularity,
run a simple regression between Popularity and Total Cast. Export the table and interpret the results,
address the following: interpretations of the coefficient, significance of the coefficient and R squared.
[hint: would you use Popularity or ln Popularity as dependent variable? If ln Popularity used as
dependent variable, how would you interpret the coefficient?]
7. In order to get a ‘true’ effect of Total Cast on Popularity, we need to use control variables. In our
data, such a control variable can be Total Crew. Run a multiple regression by adding Total Crew as a
control variable in the simple regression in Q5. Make a comparison between simple regression and
multiple regression: how do the interpretation of coefficients and model fit change? Check whether
there is a multicollinearity issue after adding the additional variable in the regression. [hint: command
vif]. Finally, select a better model between the two regressions and explain your reasoning.
8. Popularity of the movie might depend on the type of the movie. Add Genre Adventure in the
multiple regression in Q6 and interpret the results. Is there evidence that suggests adventure movies
are more popular than non-adventure movies? Explain the findings. Another argument is that the
relationship between total cast and popularity varies depending on movie types. In order to check this
hypothesis, add an interaction term between Genre Adventure and Total Cast to the multiple
regression. Export the table and interpret the results. [hint: check whether interaction term is
significant or not and the sign of the interaction term]
9. Based on the same dataset, develop a hypothesis. Based on your hypothesis, select the dependent
variable, independent variable and control variable(s) and run a multiple regression. Then develop a
second hypothesis which addresses the potential heterogeneity that might exist in the first hypothesis
and run a second multiple regression.
2
Here is an example:
Hypothesis 1: Total Cast has a positive effect on Popularity.
Hypothesis 2: The effect of Total Cast on Popularity differs between Adventure and non-
Adventure movies.
Export the tables, interpret the results of both regressions and state whether they support your
hypotheses or not. Free to use any variables of interests.
Submission instructions:
Email to Min Zou (m.zou@henley.ac.uk) and Irakli Barbakadze (i.barbakadze@pgr.reading.ac.uk) by
23:59 Monday the 17th of January 2022.
- A pdf that includes all tables, graphs, text; Note: You need to export your regression outputs using
outreg2 command. Screenshots from STATA are not accepted.
- A Stata code (.do file).
File name: MMD010_Assignment_2022_firstname_surname.pdf,
MMD010_Assignment_2022_firstname_surname.do.