Section B and C only
1
FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUTING, MATHEMATICS & DIGITAL MEDIA REASSESSMENT COURSEWORK 2013/14 UNIT CODE: 6G6Z3005
UNIT DESC: APPLIED REGRESSION AND MULTIVARIATE ANALYSIS
ASSESSMENT ID: 1CWK30
ASSESSMENT NAME: Courswork 30%
WEIGHT FACTOR: 30%
See below. NAME OF STAFF SETTING ASSIGNMENT: Dr B L Shea
0
MANCHESTER METROPOLITAN UNIVERSITY FACULTY OF SCIENCE AND ENGINEERING SCHOOL OF COMPUTING, MATHEMATICS & DIGITAL TECHNOLOGY ACADEMIC YEAR 2013-2014: REFERRED COURSEWORK BSC(HONS) FINANCIAL MATHEMATICS BSC(HONS) MATHEMATICS YEAR/STAGE THREE UNIT 6G6Z3005 : APPLIED REGRESSION AND MULTIVARIATE ANALYSIS Answer ALL questions. The pass mark is 40% which corresponds to a minimum of 72 marks out of a possible 180 marks. The deadline is 8th August 2014.
SECTION A
1. (a) Three measurementsx1, x2 andx3 have the following sample covariance matrix.
∑̂ =
9 2 0 2 4 1 0 1 4
(i) Verify that the corresponding sample correlation matrix C, is given by
C =
1 13 0 1 3 1
1 4
0 14 1
[2]
(ii) Given that one of the eigenvalues of C is equal to one, calculate the other two eigenvalues and determine the proportion of the variation in the data explained by the first principal component.
[6]
(iii) Using the sample correlation matrix C, calculate the first principal component.
[6]
(b) A Principal Components Analysis of the prices of food items in 23 cities was carried out with a view to forming a measure of the Consumer Price Index(CPI). A Minitab analysis of this data is attached.
(i) Explain why Principal Components Analysis was performedon the correlation matrix instead of the covariance matrix.
[2]
(ii) If the first Principal Component is taken as a measure of the CPI calculate, to one decimal place, the value of the index for Atlanta.
[2]
(iii) Which is the most expensive city and which is the least expensive city?
[2]
(Question 1 continued overleaf)
1
(Question 1 continued)
Minitab output for Question 1
Descriptive Statistics: bread, burger, milk, oranges, tomatoes
Variable N Mean Median TrMean StDev SE Mean bread 23 25.291 25.300 25.267 2.507 0.523 burger 23 91.86 91.00 91.63 7.55 1.58 milk 23 62.30 62.50 61.96 6.95 1.45 oranges 23 102.99 105.90 102.90 14.24 2.97 tomatoes 23 48.77 46.80 48.74 7.60 1.59
Principal Component Analysis: bread, burger, milk, oranges, tomatoes
Eigenanalysis of the Correlation Matrix
Eigenvalue 2.4225 1.1047 0.7385 0.4936 0.2408 Proportion 0.484 0.221 0.148 0.099 0.048 Cumulative 0.484 0.705 0.853 0.952 1.000
Variable PC1 PC2 PC3 PC4 PC5 bread 0.496 -0.309 0.386 -0.509 -0.500 burger 0.576 -0.044 0.262 0.028 0.773 milk 0.340 -0.431 -0.835 -0.049 0.008 oranges 0.225 0.797 -0.292 -0.479 -0.006 tomatoes 0.506 0.287 0.012 0.713 -0.391
(Question 1 continued overleaf)
2
(Question 1 continued)
Data Display
Row city bread burger milk oranges tomatoes y1
1 ATLANTA 24.5 94.5 73.9 80.1 41.6 2 BALTIMORE 26.5 91.0 67.5 74.6 53.3 132.265 3 BOSTON 29.7 100.8 61.4 104.0 59.6 147.226 4 BUFFALO 22.8 86.6 65.3 118.4 51.2 135.940 5 CHICAGO 26.7 86.7 62.7 105.9 51.2 134.235 6 CINCINNATI 25.3 102.5 63.3 99.3 45.6 138.527 7 CLEVELAND 22.8 88.8 52.4 110.9 46.8 128.907 8 DALLAS 23.3 85.5 62.5 117.9 41.8 129.733 9 DETROIT 24.1 93.7 51.5 109.7 52.4 134.632 10 HONOLULU 29.3 105.9 80.2 133.2 61.7 163.989 11 HOUSTON 22.3 83.6 67.8 108.6 42.4 128.156 12 KANSAS CITY 26.1 88.9 65.4 100.9 43.2 130.950 13 LOS ANGELES 26.9 89.3 56.2 82.7 38.4 121.925 14 MILWAUKEE 20.3 89.6 53.8 111.8 53.9 132.399 15 MINNEAPOLIS 24.6 92.2 51.9 106.0 50.7 132.459 16 NEW YORK 30.8 110.7 66.0 107.3 62.6 157.298 17 PHILADELPHIA 24.5 92.3 66.7 98.0 61.7 141.265 18 PITTSBURGH 26.2 95.4 60.2 117.1 49.3 139.707 19 ST LOUIS 26.5 92.4 60.8 115.1 46.2 136.313 20 SAN DIEGO 25.5 83.7 57.0 92.8 35.4 119.032 21 SAN FRANCISCO 26.3 87.1 58.3 101.8 41.5 126.940 22 SEATTLE 22.5 77.7 62.0 91.1 44.9 120.212 23 WASHINGTON DC 24.2 93.8 66.0 81.6 46.2 130.209
(continued)
3
2. The daily expenditures on foodx1 and clothingx2 of five people is shown in the table below.
Person Foodx1 Clothing x2 a 2 4 b 8 2 c 9 3 d 1 5 e 8.5 1
(a) Calculate the squared Euclidean distance matrix for thisdata.
[6]
(b) Use the distance matrix in (a) to construct a dendrogram using the Complete Link- age method and draw it on graph paper.
[6]
(c) For these five people calculate and plot root-mean-square standard deviation (RMSSTD) against number of clusters on a scree graph.
[5]
(d) Explain, with reasons, how many clusters your dendrogram and scree graph show. Identify those people in each cluster.
[3]
(continued)
4
3. Annual financial data for firms were collected two years ago, and divided into two groups on the basis of the firm’s current financial standing: bankrupt and non-bankrupt. The data on two variablesx1 =
current assets total liability and x2 =
current assets net sales is given in the table below for a
sample of three non-bankrupt firms.
Non-bankrupt firms x1 x2
2.49 0.54 2.01 0.53 3.27 0.35
(a) Calculate the sample covariance matrix for non-bankruptfirms.
[4]
(b) If the covariance matrix for bankrupt firms based on a sample of size 4 is (
0.36 −0.12 −0.12 0.05
)
calculate the pooled covariance matrix.
[2]
(c) Use Box’s M-test to determine whether or not it is reasonable to assume equal covariance matrices in each group.
[7]
(d) Assuming equal covariance matrices calculate the Linear Discriminant Function for the group of non-bankrupt firms.
[4]
(e) Suppose that the Linear Discriminant Function for bankrupt firms is 26.54x1 + 78.22x2 − 63.18. Predict whether or not a firm with a current assets to totalliability ratio of 3.0 and a current assets to net sales ratio of 0.05 is likely to go bankrupt.
[3]
(continued)
5
SECTION B
4. In a double blind clinical trial run by a pharmaceutical company, patients are randomly assigned to receive one of two drugs. Each trial is called a success if the patient responds positively and a failure otherwise. Drug A is given to 35 patients, of whom 14 respond positively; whereas, for patients given drug B, 27 out of 32 respond positively. You may assume that, for each drug, the number responding positively, Y out of n, follows a Binomial distribution with mass function,
f (y; π) = nCyπy(1− π)n−y, y = 0,1,...,n
(a) Suppose that the probability of a positive response is the same for both drugs, i.e. πA = πB = π, say. Write down the likelihood function forπ and hence show that the maximum likelihood estimate ofπ is π̂ = 41/67.
[6]
(b) In order to provide a statistical test of the difference,if any, in the effectiveness of the drugs, a model is proposed in which the probabilities of success for each drug are given by,πA = π, πB = δπ. (i) Show that, apart from a constant, the log-likelihood function for (π,δ) is given
by,
l(π,δ) = 41 ln(π)+ 21 ln(1− π)+ 27 ln(δ)+ 5 ln(1− δπ)
(ii) Hence find the maximum likelihood estimates ofπ andδ [8]
(c) (i) State a simple hypothesis about the value ofδ that would indicate equality in the effectiveness of the two drugs.
(ii) Carry out a likelihood ratio test of this hypothesis and report your conclusions.
[6]
(continued)
6
5. (a) A statistician proposes to model the lifetime of patients in a two-arm clinical trial using a Weibull(2) distribution. Assume that the hazard andsurvivor functions are given by,
h1(t1) = 2λt1 h2(t2) = 2(λφ)t2 S1(t1) = e
−λt21 S2(t2) = e −(λφ)t22
Assume also thatn1 andn2 patients are allocated to arms 1 and 2, respectively, and that some survival times may be right censored.
(i) Show that, apart from a constant, the log-likelihood function for (λ,φ) can be written as,
l(λ,φ) = nu1 ln(λ)− λ n1
∑ i=1
t21i + nu2 ln(λφ)−(λφ) n2
∑ i=1
t22i
wherenui indicates the number of uncensored observations in groupi.
(ii) Hence, show that the maximum likelihood estimators are(λ,φ) are given by,
λ̂ = nu1
∑n1i=1 t 2 1i
φ̂ = nu2 ∑
n1 i=1 t
2 1i
nu1 ∑ n2 i=1 t
2 2i
(iii) Show that the observed information matrix for(λ,φ) is given by,
IO(λ,φ) =
nu1 + nu2 λ2
∑n2i=1 t 2 2i
∑n2i=1 t 2 2i
nu2 φ2
[10]
(b) The following data shows the survival times in years of cancer patients, classified into two groups; treatment 1 and treatment 2 (Peto, R.et al, British Journal of Cancer, 1977). An asterisk denotes a right censored observation.
Treatment 1 Treatment 2 0.02 0.02 0.14 0.04 0.05 0.06 0.17 0.17 0.60 0.19 0.21 0.49 1.00∗ 2.33∗ 3.55∗ 0.53 0.58 1.73 3.64∗ 4.00∗ 5.41∗ 1.92 3.55 5.45∗ 6.14∗
∑ t2i = 77.9873 ∑ n i=1 t
2 i = 87.6292
(Question 5 continued overleaf)
7
(Question 5 continued)
Assuming that the patients’ lifetimes follow a Weibull(2) distribution,
(i) Find the maximum likelihood estimates of(λ,φ) for this data. (ii) State a simple hypothesis aboutφ that would indicate equality in the hazard
functions of the two groups.
(iii) Find the inverse of the infomation matrix and, hence, find an approximate 95% confidence interval forφ. Comment briefly on your result.
[10]
(continued)
8
6. (a) Briefly explain the derivation of the partial likelihood for Cox’s proportional haz- ards model,
L(β) = n
∏ j=1
exp(xTj β)
∑k∈R(t j ) exp(x T k β)
in which an individual’s hazard is modelled as,
h(t; x) = h0(t)exp(x T β)
whereh0(t) is an unspecified baseline hazard function.
[5]
(b) A study was conducted into the lifetimes (months) of two types of water pumps with results as follows:
Type I 5 8∗ 17 27∗
Type II 3 6∗ 9 11∗
where∗ indicates that the time was right censored, i.e. the pump wasstill operating at the end of the study.
(i) Letting the indicator variable
x =
{ 0 denote Type I 1 denote Type II
show that the partial likelihood in Cox’s proportional hazards model for this data is (apart from a constant of proportionality)
L(β) = e2β
( 1+ eβ
)2 × 1
4+ 3eβ
[5] (ii) Show that the partial likelihood forβ is maximised when̂β = 0.792
[3] (iii) Show that the observed information forβ is given by,
IO(β) = 2eβ
(1+ eβ)2 +
12eβ
(4+ 3eβ)2
. [4]
(iv) Conduct a test of the hypothesisH 0 : β = 0 against the alternativeH 1 : β 6= 0 and report your conclusions.
[3]
(continued)
9
SECTION C
7. (a) A random variableY follows a BinomialB(n,π) distribution with probability mass function
f (y) =
( n y
) πy(1− π)n−y, y = 0,1,...,n; 0 < π < 1
(i) Show that this distribution forms an exponential familyby writing the mass function in the form
ln( f (y)) = yθ − b(θ)
φ + c(y,φ)
[4]
(ii) Show further that the deviance function for the Binomialis given by
D = 2∑ i
[ yi ln (yi
µ̂i
) +(ni − yi)ln
(ni − yi ni − µ̂i
)]
[4]
(b) Thirty patients were given an anesthetic agent maintained at a predetermined level for 15 minutes before making an incision. It was then noted whether the patient moved, i.e. jerked or twisted. The results are shown in Table1 with variables as follows,
• conc anesthetic concentration • logconc logarithm of concentration • total total number of patients given level of anesthetic • nomove the number of patients that show no movement
conc logconc total nomove 0.8 -0.22314 7 1 1.00 0.00000 5 1 1.20 0.01823 6 4 1.40 0.33647 6 4 1.60 0.47000 4 4 2.50 0.91629 2 2
Table 1: Results from anesthetic trial
(Question 7 continued overleaf)
10
(Question 7 continued)
Using the output provided in Figure 1 answer the following questions.
(i) Calculate the mean and standard deviation of the underlying tolerance distri- bution.
[4]
(ii) The underlying tolerance distribution is assumed to belogistic, suggest an al- ternative distribution that could be used.
[2]
(iii) How does the odds of jerking or twisting vary with increasing concentration of anesthetic agent.
[2]
(iv) Calculate ED99.
[4]
Figure 1: Output for Question 7
(continued)
11
8. Data were collected from MBA students measuring whether they were happy or not. Variables are as follows:
• happiness: not happy =0 or happy =1
• sex: satisfactory sexual activity=1 or not =0
• love: lonely=1, secure relationships=2, deep feeling of belonging and caring=3
• work: 5 point scale where 1=no job, 3=OK job, 5= great job
(a) A generalised linear model with logit link was run and thefollowing table shows part of the SPSS output obtained. Complete the table.
XBPredicted XBStd. Err. lower odds upper predicted odds odds probability
1.551 0.161 0.825 1.010 0.349 1.385 2.746 5.441 0.733 -0.829 0.256 0.744 0.304
4.718 8.281 14.534 -0.320 0.109 0.726 0.899 0.421 0.242 0.102 1.274
Table 2: Table for happiness model
[8]
(Question 8 continued overleaf)
12
(Question 8 continued)
(b) An experiment was conducted to test the effectiveness ofan insecticide for killing budworms. The results are shown in Table 3, with variables asfollows,
• sex: sex of the budworm (male = 1, female = 2) • dose: dose of the insecticide trans-cypermethrin in grammes • ln dose: log of dose • ndead: number of budworms killed from 20 exposed per trial
sex dose ln dose ndead male 1 0.00 1 male 2 0.69 4 male 4 1.39 9 male 8 2.08 13 male 16 2.77 18 male 32 3.47 20
female 1 0.00 0 female 2 0.69 2 female 4 1.39 6 female 8 2.08 10 female 16 2.77 12 female 32 3.47 16
Table 3: Insecticide used on budworms
Using the output provided answer the following questions:
(i) Is there any evidence of an interaction effect?
[2]
(ii) Calculate LD50 and LD90 rates for both male and female budworms.
[8]
(iii) What is the relative potency of the insecticide compared between the two sexes?
[2]
(Question 8 continued overleaf)
13
(Question 8 continued)
Figure 2: Output for Question 8
(continued)
14
9. (a) Consider a log-linear model for an(a×b) contingency table indexed by two factors; A with a levels andB with b levels.
(i) Write down the form of the log-linear main effects model. State the number of parameters in this model and, hence, the residual degrees offreedom.
[2]
(ii) Write down the form of log-linear model including an interaction term between A andB. State the number of parameters in this model, the residual degrees of freedom and the deviance.
[2]
(iii) Explain how to carry out a test of independence betweenA andB and give the degrees of freedom for this test.
[2]
(b) A follow up study was conducted on the incidence of coronary heart disease amongst all male workers at a car factory. The following factors, A,B,C,D,E and F were recorded:
1. smoke [A]: either Yes or No 2. mental [B]: does strenuous mental work either Yes or No 3. phys [C]: strenuous physical work either Yes or No 4. systol [D]: systolic blood pressure, either high 140+ or < 140 5. protein [E]: ratio of lipoproteins< 3 or 3+ 6. family [F]: reported history of heart disease either Yes or No
A graphical model was developed for the data.
(i) Describe the graphical modelling procedure.
[5]
(ii) What conclusions can you draw from the first stage of edge deletions given in Table 4?
[4]
(iii) The final model (after all possible edge deletions havebeen made) is drawn in Figure 3. Write the model in [XYZ] form.
[2]
(iv) Which of the following statements are True/False: 1. The model is decomposable. 2. P(D,E|A) = P(D|A)P(E|A) 3. P(A,F|C) = P(A|C)P(F|C)
[3]
15
Model deviance residual df p-value [AB] d 22.65 16 0.12341 [AC] d 42.80 16 0.00030 [AD] d 28.72 16 0.02589 [AE] d 40.02 16 0.00077 [AF] d 21.31 16 0.16690 [BC]d 684.99 16 0.00000 [BD] d 12.23 16 0.72800 [BE]d 17.23 16 0.37087 [BF]d 22.79 16 0.11947 [CD]d 14.81 16 0.53860 [CE]d 18.63 16 0.28832 [CF]d 22.15 16 0.13841 [DE]d 18.35 16 0.01322 [EF]d 18.32 16 0.303804
Table 4: 1-edge deleted graphical models
[A] smokes [B] mental
[C] phys [D] systol
[E] protein [F] family
Pajek
Figure 3: Final graphical model
16
6GZ3005 Additional Formulae
Principal Components Analysis
Corr(x,y) = Cov(x,y)√
Var(x)×Var(y)
Cluster Analysis
RMSST D=
√ ∑pj=1 S
2 j
p
Discriminant Analysis
∑̂ = 1
(n− 1)
( ∑ x21 −
(∑ x1) 2
n ∑ x1x2 − ∑ x1 ∑ x2
n
∑ x1x2 − ∑ x1 ∑ x2
n ∑ x 2 2 −
(∑ x2) 2
n
)
Sp = (nA − 1)SA +(nB − 1)SB
nA + nB − 2
Box’s M test
To testH0 : ∑1 = ∑2 v H1 : ∑1 6= ∑2 . The test statisticU = γM where
γ = 1− (
2p2 + 3p− 1 6(p+ 1)(k− 1)
)(( k ∑ i=1
1 (ni − 1)
) −
1 N − k
)
and
M = (N − k) ln ∣∣Sp ∣∣−
k
∑ i=1
(ni − 1) ln|Si|
N = ∑ki=1 ni (total number of cases in the sample),p = number ofx-variables andk = number of groups.
UnderH0 U has a chi-square distribution with p(p+1)(k−1)
2 degrees of freedom.
Linear Discriminant Function
17
LDF(A) = µTAΣ −1x−
1 2
µTAΣ −1µA
The log-rank test
Suppose that there arer distinct ordered failure timesti in the two groups of interest, summarise the data in a series ofr two-way tables as follows,
Number of Number surviving Number at Group failuresdi beyondti risk, R(ti)
I d1i R1(ti)− d1i R1(ti) II d2i R2(ti)− d2i R2(ti)
Total di Ri − di Ri
The expected number of failures in the first group is,
e1i = R1(ti)× di Ri
The log-rank test is,
R = r
∑ i=1
d1i − r
∑ i=1
e1i
with variance,
VR = var(R) = r
∑ i=1
R1i R2i di(Ri − di)
R2i (Ri − 1)
= r
∑ i=1
v1i, say.
Under H 0 , the statistic
R2
VR ∼ χ21
Exponential family
A random variableY with probability density(mass) functionf (y) is said to form an expo- nential family if we can write,
ln( f (y)) = yθ − b(θ)
φ + c(y,φ)
18
- 6G6Cover refv2
- 6G6Z3005 exam coverpage
- 6G6Z3005S