Statistics tasks need done

profilehelpmepleae
101hypothesistesting.pdf

101 hypothesis testing Part 1 11/12/2020 Part 2 11/13/2020 Part 3 11/15/2020 Homework assignment Due date: 11/22/2020 Carry out a two tailed hypothesis test for Elog absX = 3.5 at the 95% confidence level using the ideal normal distribution, with the below data from June to September each year, as we show in the example below. Then, carry out a right tailed hypothesis test at the 97.5% level, using the ideal t-distribution, as in the later example below. Just follow my steps, or, if you wish, make up your own testing procedure, but preferably after studying my examples.

Part 1 what is a hypothesis? When we come to study a system via experiment or some sort of sampling procedure, we do not come to the investigation in total absence of cognitive frameworks. The phenomena we investigate will connect up with aspects of our knowledge of the world, or theoretically, as of mathematics. In other words, we almost always come to a data collection process with preconceptions, prejudices, hunches, biases, and knowledge from prior work. Thus, we usually come to an experiment or sampling process with certain expectations of outcomes or tentative guesses about what we expect. These preliminary expectations are referred to as hypotheses. An experiment is sometimes designed partly around exploring the extent to which empirical data can support these preliminary expectations. When we try to assess the extent of support, this is called hypothesis testing. Thus, in a sense, hypothesis testing is a pretty common human pursuit, and something akin to mere curiosity about the world. However, when we refer to hypothesis testing in statistics we mean explicitly the statistical framework first enunciated by Neyman and Pearson in 1933. They had a very formal mathematical approach, in which we see the experiment in the perspective of the following table: Evidence supports hypothesis does not support hypothesis Accuracy of hypothesis (Description of world) accurate correspondence Type 1 error not accurate Type 2 error correspondence accurate and evidence supports the hypothesis This is a THE ideal. We propose a hypothesis or guess, and the experiment seems to support it. If the hypothesis is of some broad interest, with respect to your goals. This is, however, unusual for actual research. When a researcher poses a serious hypothesis of interest it USUALLY means it is VERY difficult to get at by current methods and resources, and may require a very large expenditure of resources, and access to skill and knowledge well beyond the researcher’s reach. This is particularly the case even in well-funded academic research. Needed expertise, knowledge, skill and equipment is often ASTRONOMICALLY expensive. Thus, at least in academic research, it is usual, when there are extremely serious hypotheses that MUST be investigated from the viewpoint of funding agencies, and other sources of support, it is usual to select a location, often very near a well-regarded research university, to set up a special lab or institute focused on the research. This allows channeling of government resources, and gives a center for talent to be directed. As we work more online, in these times, it is somewhat less important, but even so, it is unusual to see faculty not associated with special labs and institutes directing research toward some serious hypothesis. not accurate and evidence does not support the hypothesis This would seem not to be desirable: Formulate a hypothesis that you think might not be accurate and then accumulate data indicating it has little support. That seems inefficient superficially. Strategically, this approach is preferred by many top academic researchers. At first, to outsiders in the academic research world, this type of “trivial pursuit” can seem bizarrely pointless. Let’s try to flesh this out a little. First, when a research area is initially opened up, it usual entails new perspectives, resources, skills and knowledge outside most researchers’ specializations. This means the “hot” research is usually “out of reach”. Still, an academic researcher in a related field with expertise knows that if he or she can

“connect” it will probably lead to a publishable result via the bandwagon effect. Furthermore, to maintain their status and power position an academic researcher usually needs a fair number of relatively “high quality” publications: This state of affairs has always been known as “publish or perish”. This is the merit of high specialization, expertise, and skill level at particular research specialty niches. The academic researcher develops an extreme familiarity with one or perhaps several very, very narrow fields of study: Becomes deeply enmeshed and capable at its intricacies, its basic and subtle methods, and the classic approaches to research. This is done by the PhD apprenticeship, various internships, post-doctoral positions, various research positions, special workshop training, collaborative teams, and in teaching and basic scholarship. In this way, an academic research REALLY knows certain highly intricate and complex details and subtleties concerning certain research areas. When he or she achieves such levels of expertise, if it can be connected to the “hot” work going on, by formulating a suitable hypothesis accessible to the researchers’ expertise, and the hypothesis is generally an acceptable one, if the researcher can adduce evidence against it, the result is almost certainly one, possibly more, publications. There are a number of good points in favor of this strategy. First, most supported claims are associated usually with some hidden anthropocentric biases or prejudices. Therefore, it is about 80% probable that a typical accepted scientific claim has some incorrect aspects related to this bias. A second, perhaps critical point about selected a well-favored hypothesis accessible to your specialty is that you are VERY likely to note something interesting or inconsistent, or see little ways to innovate. In other words, in the “publish or perish” world of academia, these two characteristics alone are telling you that this kind of “straw man” hypothesis test is usually your meal ticket in academia. Of course, to maintain quality, you must be proficient at the very high level of standards in academia because it is hugely rule heavy, and writing a successful paper for publication is much like playing a good game of chess. An academic must REALLY know how to “dot the i’s” and “cross the t’s”. But training in a specialization usually gives proficiency at this: It is all self-motivated, and you have to basically want to play the game, because the academic publish or perish game is usually pretty sharp and brutal. Inevitably, this “gaming the system” turns off a lot of outsiders to academic research. We can ask the question as to whether or not it is in the long run worthwhile and beneficial to the researcher’s culture or to some broader segments of humanity. Chasing grants and writing papers and guiding students and other researchers in your project involves a lot of time and effort. Is this a good pursuit, in terms of benefits to humanity? Certainly it maintains the researcher’s status and power. In addition, it helps to sustain the university or institute with grants and status and prestige. A research/academic setting of a college or university is a lot like an athletic team, and its perform must be sustained. Therefore, in these immediate ways all the research is valuable. The fact that only perhaps 5 to 10 other specialists may read the papers means the results if important are not going to emerge in the public immediately and likely never. Training students and other researchers is of course a valuable exercise. So the apprenticeships of the PhD have social value. Plus, the fact that the researchers usually have to chase grant money means at least that their research must seem of some value to experts in related fields. accurate and evidence does not support hypothesis: Type I errors We have to see this in the context of typical academic research: Selection of a hypothesis that seems to be commonly accepted. We do this to get publications by finding contrary evidence, and in a way that is a pretty good “typical and useful task” to question common assumptions. Unfortunately, in this context

it is AWFUL to get a Type I error because your paper is likely to be published and may have a massive effect of suppressing research in what was previously an accepted approach until your negative results. Therefore, we need to adjust our statistics studies to minimize the chance of a Type I error: In academic research this type can be tragic and devastating. In constructing confidence intervals, we set the “probability” alpha of a Type I error in our examples at 5%. Why don’t we just set it at 0 if we need to minimize it??? That would mean the accepted hypothesis we are testing is not accessible to any negative support from our experiments or surveys. In other words: We don’t even have motivation to carry out the study and it gives no opportunity for a publication. Thus, we cannot afford to set alpha at the ideal value of 0, but it is not something that can have ANY natural value. So setting the value is purely an abstract subjective decision based on what you think will convince reviewers of your paper to allow its publication. The value NEEDS to be close to 0 but greater than zero, and at a practical level to get successful negative results plus convince reviewers that your paper is acceptable for publication when they do the “automated proof checking” of the “academic chess game”. So it is basically a matter of what the “research public” charges for getting a publication: Usually people accept a 5% or so value for alpha. Of course, it depends on prestige level. If you want to get a Nobel prize in physics, you better set alpha around 0.000001% or so, i.e. 1/millionth of a percent and the hypothesis being tested better be SUPER important. setting alpha very close to zero When we set alpha very close to zero, we start to make it more likely that we will make a Type II error: Finding support for an inaccurate hypothesis. In context of the typical academic philosophy and strategy, this is not too likely. We usually choose a hypothesis that is commonly accepted, and do not try to publish unless lack of support is obtained from our data. Nevertheless, it is quite common for academics to try to avoid this error. The probability of making this error is called beta. The power of the hypothesis test is 1 – beta. Why do academics do this??? This too, is strategic. If we get only one convincing negative data item not supporting the hypothesis, THAT is not too convincing. The idea about beta is to get researchers to find at least several data items not in support of the hypothesis. Now, the unusual data values are established via fitting the data to a distribution, like the normal distribution or the t distribution. When we get several clustered data items, we can set up an alternative hypothesis involving a data fit to these values. The idea is to think of the clustered extreme values not supporting the hypothesis as heavily weighted and the values close to our original hypothesis center we were testing as in the tail of the distribution, and given a low weight. By doing this, we are giving people an indicator that the true value, the true center, lies somewhere in the vicinity of the center of the alternative hypothesis probability distribution fit and the clustered extreme values weighing against the original hypothesis. This usually gives added weight to a research paper and makes it more likely to get published, as it predicts what an accurate hypothesis test might look like or at least points in a good direction. historical context Hypothesis testing was introduced by Neyman and Pearson in a 1933 paper. As we indicate above, it has proven extremely useful for academic researchers to get publications from their research. Many

people have objected to this strategy of turning research into a type of “chess game” and researchers “gaming the system” for status and power. This completes our discussion for now.

Part 2 confidence intervals vs hypothesis testing Because both confidence intervals and hypothesis testing focus on alpha, the probability of a Type I error, associated with how the critical z-values are selected, the “method” of statistics is essentially identical. It is the “philosophy” that is different: “confidence intervals” are fine for many statistical studies of a traditional nature, but it is usually questioning “a commonly held belief” and “finding statistical support challenging it” that is the key to cheap research and getting your research published and at a satisfactory level of quality for the academic fields utilizing statistics. Therefore, the emphasis between confidence intervals and hypothesis is totally that of pragmatism: Hypothesis testing is pretty much the bread and butter of academics involved in the publish or perish environment of academic research. So methodologically we do not expect much difference between a hypothesis test and constructing a confidence interval. p-value It is common to state the p-value for a hypothesis test. Using the probability distribution we have fit the data to for the hypothesis test, we compute, for the random variable X associated with our study, the probability p of the most extreme deviation from the center value relative to the hypothesis we are testing. Strategically, this is often regarded by academics trying to get their research published, as more relevant than a critical z-value. The strategic reason for academics’ preference for the p-value is that it avoids claims about alpha. In fact it is not even necessary to calculate a critical z-value if you use p-values, but to enhance publication chances it is often wise, as alpha is “traditional” and “recognized”. The merit of the p-value is strategic: Especially in a world where we now often have access to enormous amounts of data. With such a very large population to draw from, it is usually, by using software to search the data, very feasible to find a large collection (but small compared to the total data population) that is significantly not in support of the commonly accepted hypothesis. The p-value for the resulting probability distribution fit will typically be MUCH less than 5%. This usually gives sharp researchers with access to a lot of data enormous advantages in the publish or perish academic game. The problem with this type of gamesmanship, of course, is that it greatly enhances chances of losing faith in the conventional hypothesis which might actually be accurate. As we pointed out before, a single researcher playing this kind of game will not kill convention, but if there is a “bandwagon” effect, a beneficial and well-supported result might lose research directed in its favor, with a concomitant loss of benefit. hypothesis test example Let’s test the claim that crude oil production is slowing in California at an exponential rate. This means that log(abs((production at time t + T) – (production at time t))) is roughly constant for a particular time T (relatively short) and t = variable time. Here, abs is the absolute value: For example, abs(5) = 5 and abs(-5) = 5.

data (on California crude oil production: see Sec 8.5 for details)

strategy We are just going to go through a very simple example of a test today. We will use X[t] = (production in January + February + March + April + May from year t)

– (like total from prior year) We take a sum to get a larger value to subtract because subtractions cost in loss of information and we amplify and smooth over noise when we add several months together. table (showing lists on TI calculator and instructions to create the lists) year production, Y X = Y – Y(prior year) log abs(X) seq(L1(X) – L1(X – 1),X,2,11) sto L2 log(abs(L2)) sto L3 L1 L2 L3 1981 149010 1982 153536 4526 3.6557 1983 153809 273 2.4362 1984 157064 3255 3.5126 1985 161217 4153 3.6184 1986 163369 2152 3.3328 1987 150882 -12487 4.0965 1988 149511 -1371 3.1370 1989 138432 -11079 4.0445 1990 133365 -679 3.7048 1991 132686 2.8319 comment It does appear that the logarithm list is roughly constant, between about 2.8 and 4.0. This is unexpected because production peaks then decreases so the differences X vary from positive to negative values. hypothesis test step 1: find the point estimate Elog(abs(X)) using empirical frequencies Elog(abs(X)) = sum[X] log(abs(X)) (f/N) N = number of data items = 10 here f = frequency for the particular value X (all frequencies = 1 here) Elog(abs(X)) = (1/10) sum(L3) here (using the TI calculator) = 3.4370 hypothesis test step 2: specify alpha and type of test (two tailed; left tailed; right tailed) Since the null hypothesis is that the population of X targets a constant, we want a two – tailed test, i.e. with alpha/2 as the probability for each tail. For the hypothesis test, alpha is called the level of significance. For our test we take alpha = 0.05 or 5% and alpha/2 = 0.025. This corresponds to the 1 – alpha, 95% confidence interval, but here we are making an actual test. hypothesis test step 3: specify the ideal data fit probability distribution for the population We assume the population is fit by the ideal normal distribution. Since the distribution is symmetric about Z = 0, we only need the lower critical value using Solver on the TI MATH > Solver > normalcdf(-1000, X) – 0.025 > alpha ENTER We set the guess at -1.5. Result critical z-value = 1.9600 (take the negative of the calculator result)

hypothesis test step 4: estimate the standard deviation of the list of expectation values (assuming the population is normally distributed) We use the sample empirical probability estimate of the variance for the list of X: Varlog abs(X) = (1/(N – 1)) sum[X] (log abs(X) – Elog abs(X))2 f (with f = 1 here). Varlog abs(X) = (1/9) sum((L3 – 3.4370)2 ) (in this case) = 1.5983 Estimate of variance of list of EX VarElog abs(X) = Varlog abs(X)/N = 0.15983 (Using this for the hypothesis assumes the validity of the central limit theorem.) hypothesis test step 5: specify a test value for the test for EX We have a point estimate: Elog abs(X) = 3.4370, and the values range from about 2.8 to 4.0. Let’s select as the test value Elog absXTest = 3.5, which seems to be suggested by the data median. hypothesis test step 6: specify the lowest acceptable value for EX and the highest (we have both a lowest and a highest for a two tailed test) lowest acceptable value = Elog absXTest – VarElog absX x (critical z-value) = 3.5 – 0.15983 x 1.9600 = 3.1867 highest acceptable value = 3.5 + 0.15983 x 1.9600 = 3.8133 conclusion of the hypothesis test As the sample Elog absX = 3.4370 lies between the highest and lowest acceptable values, we conclude that the data here does not satisfy the conditions for finding negative support for the hypothesis. As this is the goal of successful hypothesis testing, this hypothesis test fails at the 95% confidence level, i.e. we cannot reject the claim of the hypothesis. Next time we will give a more complete treatment of the hypothesis test for the data here.

Part 3 comment about hypothesis testing We have given a very simple worked example of a hypothesis test. We selected a hypothesis that seems well accepted about declining resources, so it totally looked like we would be able to succeed at adducing evidence against it, even in the very limited data we used from California. Still the simple example shows the methodology. Of course, the sole interest, from a professional perspective of an academic researcher, is to get a publication to maintain or further his or her career. So we failed in that regard since getting a paper accepted in an academic journal usually implies, minimally, that you found evidence against commonly accepted notions. In a way, it is not tragic that our example failed. Usually, in actual academic research, the first proposals do not work out for one reason or another. The researchers either revise their project or drop it as unproductive. A lot often depends on the funding agency, and that means the review committees supporting your work. Usually this is somewhat preferable to leaving it up to the individual researcher. A committee of experts can sometimes see a bit more clearly than an individual, especially in practical terms and terms of utilization of scarce resources and expertise and time. The downside to allowing a committee to decide is that as specialists, their assessment of your work is like chess experts evaluating the play of a master: They can pretty much tell you whether you are playing a good game or not. The trouble with that kind of specialized “game” orientation is that maybe you aren’t playing chess at all, and they are missing the importance of what you are working on. what next? It is important to present some other perspectives with respect to hypothesis testing. Our first test led to failure plus a bit of a weird unexpected result. That used the normal distribution. As we know from prior studies, the normal distribution is usually too narrow (like the binomial distribution) to fit data well, and this is related to the usual suspects. Obviously then, as we have done in the past, we want to go on to the t-distribution, a broader, more flexible distribution, with df = degrees of freedom as an additional parameter. This leads to some tolerance for picking up on other types of processes than those related to the peak of the distribution, which may be introducing outliers. In addition, the data collecting/analysis process has inherent tendencies toward anthropocentric bias, and the investigators are almost certain to be making some errors when large amounts of data are collected. This entails an overall noise throughout the distribution, more apparent in the tails, so a broader tailed distribution can be more effective in separating signal from noise. progressing beyond the beginning stage The t-distribution is obviously an excellent next step, as just introducing one extra parameter, namely df, is not likely to lead to overfitting (i.e. fitting to noise over signal) which is always a danger when introducing additional parameters. Plus as mentioned above it has obvious advantages over the normal distribution, including the fact that the ideal t-distribution is practically as easy to find critical z-values for as the ideal normal distribution. Obviously matters start to get much more subtle as we move beyond the simple normal distribution/central limit theorem perspective of Neyman. This was one of the main hold-ups in Neyman’s rather idealized take on statistics. There were numerous harsh practical details that needed to be worked out over the years, costing a lot in terms of effort and resources, and accompanied with substantial criticism from elite researchers who could see the limitations of Neyman’s introductory

work, and were troubled by the paths being pursued to turn it into a practical tool. They did not want to rock the boat. The method works today pretty well for the established academic researchers. t-distribution Because the method is so similar to the construction of confidence intervals, we are going to present just this one more example before leaving the topic of hypothesis testing. We will change to a right- tailed test or a left-tailed test, over the two-tailed test, as opposed to our first normal distribution example, to give a little insight into a potentially useful trick. Overall, if you want to make it in academia in the publish or perish world right now, you really need to be savvy about hypothesis testing, but for us at the beginning level, most of whom do not intend to become elite researchers in academia, delving deeper into this rather specialized and tricky “academic chess game” is not of much interest. table (showing lists on TI calculator and instructions to create the lists) year production, Y X = Y – Y(prior year) log abs(X) seq(L1(X) – L1(X – 1),X,2,11) sto L2 log(abs(L2)) sto L3 L1 L2 L3 1981 149010 1982 153536 4526 3.6557 1983 153809 273 2.4362 1984 157064 3255 3.5126 1985 161217 4153 3.6184 1986 163369 2152 3.3328 1987 150882 -12487 4.0965 1988 149511 -1371 3.1370 1989 138432 -11079 4.0445 1990 133365 -679 3.7048 1991 132686 2.8319 We need to group the data in a class/frequency table. We will start at 2.1 to 2.5: class/frequency table class frequency, f 2.15 – 2.55 1 2.55 – 3.05 1 3.05 – 3.55 3 3.55 – 4.05 4 This table reveals clearly that the data is not near a symmetric distribution. This does not mean that the hypothesis test will be unreliable. We are simply using just a few data values. Still the skew to lower values is pronounced. Therefore, a right tailed hypothesis test should work okay in this case. hypothesis test step 1: find the point estimate Elog(abs(X)) using empirical frequencies (This step does not depend on the specific hypothesis, so we have the same result as before.) Elog(abs(X)) = sum[X] log(abs(X)) (f/N) N = number of data items = 10 here f = frequency for the particular value X (all frequencies = 1 here)

Elog(abs(X)) = (1/10) sum(L3) here (using the TI calculator) = 3.4370 hypothesis test step 2: specify alpha and type of test (two tailed; left tailed; right tailed) It looks like a right – tailed test is likely to succeed in this case, as the actual distribution has a long left tail. So we will test, with 97.5% confidence (alpha = 0.025) that the value of the population EX is larger than 3.5 (before we tested for “equality”). hypothesis test step 3: specify the ideal data fit probability distribution for the population We assume the population is fit by the ideal t-distribution, setting df = N – 1 = 9 (here). Since the distribution is symmetric about Z = 0, we only need the lower critical value using Solver on the TI MATH > Solver > tcdf(-1000, X, df) – 0.025 > alpha ENTER We set the guess at -1.9600 (the old ideal normal distribution lower critical value). Result critical z-value = 2.2622 (take the negative of the calculator result: Since we are making a right tailed test we need the upper critical value.) hypothesis test step 4: estimate the standard deviation of the list of expectation values (assuming the population is normally distributed) (This step remains unchanged from the prior case of the ideal normal distribution, as the standard deviation, like the expectation value, does not depend on the hypothesis test, nor does it depend on the critical z-value.) We use the sample empirical probability estimate of the variance for the list of X: Varlog abs(X) = (1/(N – 1)) sum[X] (log abs(X) – Elog abs(X))2 f (with f = 1 here). Varlog abs(X) = (1/9) sum((L3 – 3.4370)2 ) (in this case) = 1.5983 Estimate of variance of list of Elog abs(X) VarElog abs(X) = Varlog abs(X)/N = 0.15983 (Using this for the hypothesis assumes the validity of the central limit theorem.) hypothesis test step 5: specify a test value for the test for EX We have a point estimate: EX = 3.4370, and the values range from about 2.8 to 4.0. Let’s select as the test value EXTest = 3.5, which seems to be suggested by the data median. hypothesis test step 5: specify a test value for the test for Elog absX We continue to use Elog absXTest = 3.5, as before. This has nothing to do with selecting a right tailed, left tailed or two tailed test. hypothesis test step 6: specify the lowest acceptable value for EX and the highest (we have a lowest acceptable value for a right tailed test. Since we have examined the data with a class/frequency table we have pretty much “loaded the dice” in our favor in this case.) lowest acceptable value = Elog absXTest + VarElog absX x (critical z-value) = 3.5 + 0.15983 x 2.2622 = 3.8616 conclusion of the hypothesis test As the sample Elog absX = 3.4370 lies below the lowest acceptable values, we conclude that the data here does satisfy the conditions for finding negative support for the hypothesis. As this is the goal of

successful hypothesis testing, this hypothesis test succeeds at the 97.5% confidence level, i.e. we have evidence in favor of rejecting the claim of the hypothesis. This concludes our discussion of Chapter 10.

https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#cite_note-45