Statistics project & Quiz 6

profilesm77
statisticsprojectexample.pdf

Page 1 of 6

Statistics Project

Temperatures in January, 2006 in Purcellville, Virginia

Submitted by Suzanne Sands

Purpose: Analyze temperatures for January, 2006 in my local region, Purcellville, Virginia. Most people are interested in the local weather, including me! I focused on the weather in January, 2006. News

reports indicated a much warmer January than usual. I was interested in compiling descriptive summaries in the

form of charts and numerical measures to get a sense of the typical temperature for January, 2006, and how the

temperatures have varied over the course of the month. (This particular project example is an adaptation of similar

project examples I have used in statistics classes I have taught in the past.)

Data: Random Sample of 30 Temperatures in January, 2006 in Purcellville, Virginia

Data Collection: An excellent website, www.weatherunderground.com, provides temperature readings from thousands of weather stations. Toward the middle of the screen, I typed “Purcellville” in the “Location” box and

arrived at the Purcellville forecast. At the bottom of that page, there are links for personal weather stations. I

clicked on the “Top of Tranquility, Purcellville, VA” link and arrived at

http://www.wunderground.com/weatherstation/WXDailyHistory.asp?ID=KVAPURCE1

You can search for weather readings any day you like in the recent past. This particular weather station

recorded 12 temperatures every hour back in 2006, so there were 12 readings/hr x 24 hours x 31 days = 8,928

temperature readings for January, 2006! I decided to select a simple random sample of 30 temperatures from

this large collection of data.

I collected 30 temperatures at random times in January, 2006. (Random sampling is NOT a requirement for your project. For instance, you could record the high temperature for each day.) FYI: Here is how I chose the random sample: Since there are 31 days in January, I generated 30 random numbers between 1 and 31 (with possible repetition). (You will see below that many days are repeated.) Next I determined

the sampling times. Since there were 288 temperature readings each day, I generated 30 random numbers between 1

and 288, representing the reading numbers. Since there were 12 readings per hour, I divided the reading random

number by 12 to get the hour and used the remainder to figure out which reading to choose during that hour. I

looked up the temperatures for each randomly selected day and time, and recorded the appropriate temperature.

Count

(January, 2006)

Date Time Temperature (degrees) Count

(January, 2006)

Date Time Temperature (degrees)

1 1 2:41 44.1 16 13 11:55 48.4

2 2 6:35 31.3 17 13 16:01 57.9

3 2 16:21 39.7 18 15 17:35 34.9

4 2 18:11 40.6 19 16 4:55 32.4

5 3 11:45 39.7 20 16 11:11 37.6

6 4 8:05 35.2 21 16 18:35 36.3

7 4 11:55 42.4 22 18 9:01 44.6

8 4 20:35 39.9 23 18 13:45 40.5

9 4 23:52 40.3 24 23 5:41 34.2

10 5 5:21 39.2 25 25 20:31 32.2

11 9 11:31 52.5 26 25 21:31 31.9

12 10 2:45 43.7 27 27 2:11 27.7

13 10 4:21 43.0 28 28 14:35 61.7

14 12 2:31 37.9 29 28 19:21 48.7

15 12 16:55 54.9 30 31 1:01 48.7

Page 2 of 6

Temperature Data, in ascending order:

27.7 31.3 31.9 32.2 32.4 34.2 34.9 35.2 36.3 37.6 37.9 39.2 39.7 39.7 39.9

40.3 40.5 40.6 42.4 43.0 43.7 44.1 44.6 48.4 48.7 48.7 52.5 54.9 57.9 61.7

Notes: To construct a frequency distribution, typically we need to group the data into about four to eight intervals. In

looking over the sorted data, ranging from 27.7 to 61.7, it seems reasonable to use intervals of width 5 or 10 degrees.

Frequency Distribution:

Grouped in intervals of 10 degrees

Grouped in intervals of 5 degrees

REMARKS: Both tables show that the temperatures are principally clustered in the 30’s and 40’s. Which

table is better? It’s really a toss-up; either one is fine. It’s not necessary to make more than one table. I am

showing two tables, just for illustration purposes.

If a table has very low frequencies for all of the intervals (say a frequency of 1-2 for each interval), or if

there are more than 10 intervals, that would be an indication that the interval width is too small. For

example, if each interval consisted of just one degree, then the frequency table for this temperature data

would have over 30 rows and that table would not be very informative, in terms of helping to see where

the data are clustered.

30 Random Temperatures in January, 2006, Purcellville, VA

Temperature (degrees) Frequency

Relative Frequency

19.95 - 29.95 1 .033

29.95 - 39.95 14 .467

39.95 - 49.95 11 .367

49.95 - 59.95 3 .100

59.95 - 69.95 1 .033

Total 30 1.000

30 Random Temperatures in January, 2006, Purcellville, VA

Temperature (degrees) Frequency

Relative Frequency

24.95 - 29.95 1 .033

29.95 - 34.95 6 .200

34.95 - 39.95 8 .267

39.95 - 44.95 8 .267

44.95 - 49.95 3 .100

49.95 - 54.95 2 .067

54.95 - 59.95 1 .033

59.95 - 64.95 1 .033

Total 30 1.00

Page 3 of 6

Histogram

The histogram is a visual representation of the frequency distribution on the previous page, with the

temperatures grouped in intervals of 5 degrees.

The majority of temperatures fall between 34.95 and 44.95 degrees.

The histogram was generated with spreadsheet software. Your histogram does not have to be fancy. It can be hand-

drawn or typed in plain text form. It is important that the scales and the labeling are clear and accurate.

Plain text histogram: Temperatures in January, 2006 in Purcellville, Virginia

Frequency |

9---|

| 8 8

8---| |XXXXXXX|XXXXXXX|

| |XXXXXXX|XXXXXXX|

7---| |XXXXXXX|XXXXXXX|

| 6 |XXXXXXX|XXXXXXX|

6---| |XXXXXXX|XXXXXXX|XXXXXXX|

| |XXXXXXX|XXXXXXX|XXXXXXX|

5---| |XXXXXXX|XXXXXXX|XXXXXXX|

| |XXXXXXX|XXXXXXX|XXXXXXX|

4---| |XXXXXXX|XXXXXXX|XXXXXXX|

| |XXXXXXX|XXXXXXX|XXXXXXX| 3

3---| |XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|

| |XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX| 2

2---| |XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|

| 1 |XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX| 1 1

1---| |XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|

| |XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|XXXXXXX|

0-- .----|-------|-------|-------|-------|-------|-------|-------|-------|

24.95 29.95 34.95 39.95 44.95 49.95 54.95 59.95 64.95

Temperatures (Degrees Fahrenheit)

(NOTE: If typing in plain text, use a fixed width font, such as Courier New)

1

6

8 8

3

2

1 1

0

1

2

3

4

5

6

7

8

9

25.95- 29.95

29.95- 34.95

34.95- 39.95

39.95- 44.95

44.95- 49.95

49.95- 54.95

54.95- 59.95

59.95- 64.95

F re

q u

e n

c y

Temperature (degrees Fahrenheit)

30 Random Temperatures in January, 2006, Purcellville, VA

Page 4 of 6

MEDIAN: When the 30 data values are sorted, since 30 is even, the median is the average of the observations in the

middle, the average of the values in positions 15 and 16 in the sorted list.

27.7 31.3 31.9 32.2 32.4 34.2 34.9 35.2 36.3 37.6 37.9 39.2 39.7 39.7 39.9

40.3 40.5 40.6 42.4 43.0 43.7 44.1 44.6 48.4 48.7 48.7 52.5 54.9 57.9 61.7

Median = (39.9 + 40.3)/2 = 40.1 degrees.

SAMPLE MEAN = �̅ = 1242.1/30 = 41.40 degrees = the sum of the temperatures, divided by the sample size

Note that the mean is larger than the median. The histogram has a longer right "tail" compared to the left

end, due to a few relatively high temperatures. The mean is affected by the size of the highest

temperatures, but the median is not, so the mean is larger than the median.

RANGE = 61.7 - 27.7 = 34.0 degrees = the difference between the maximum and minimum

SAMPLE VARIANCE = 66.1417 (calculations shown on the next page; used a spreadsheet & pasted it in the document)

SAMPLE STANDARD DEVIATION = s = 8.13 degrees (calculation shown on the next page)

Data within one standard deviation of the mean must fall in the interval

��̅ − �, �̅ + �� = �41.40 − 8.13, 41.40 + 8.13� = �33.27, 49.53�

Data within two standard deviations of the mean must fall in the interval

��̅ − 2�, �̅ + 2�� = �41.40 − 2�8.13�, 41.40 + 2�8.13�� = �25.14, 57.66�

Data within three standard deviations of the mean must fall in the interval

��̅ − 3�, �̅ + 3�� = �41.40 − 3�8.13�, 41.40 + 3�8.13�� = �17.01, 65.79�

__ __ 27.7 31.3 31.9 32.2 32.4 34.2 34.9 35.2 36.3 37.6 37.9 39.2 39.7 39.7 39.9

40.3 40.5 40.6 42.4 43.0 43.7 44.1 44.6 48.4 48.7 48.7 52.5 54.9 ___ 57.9 61.7 __

In the interval �33.27, 49.53�, there are 21 temperatures, and 21/30 = 70.0%

In the interval �25.14, 57.66� , there are 28 temperatures, and 28/30 = 93.3%

In the interval �17.01, 65.79� , there are 30 temperatures, and 30/30 = 100.0%

So, 70.0% of the temperatures fall within one standard deviation of the mean, 93.3% of the temperatures

fall within two standard deviations of the mean, and 100% of the temperatures fall within three standard

deviations of the mean. For a bell-shaped distribution, the respective percentages are approximately 68%,

95%, and 100%. For the temperature data, the percentages are reasonably close to the bell-shaped model,

so yes, the data distribution is approximately bell-shaped.

Page 5 of 6

Calculation of sample variance and sample standard deviation:

Col 1 Col 2 Col 3 Col 4 = [Col 3]^2

Count Temperature x x - Mean (x - Mean)^2

1 44.1 2.6967 7.2720

2 31.3 -10.1033 102.0773

3 39.7 -1.7033 2.9013

4 40.6 -0.8033 0.6453

5 39.7 -1.7033 2.9013

6 35.2 -6.2033 38.4813

7 42.4 0.9967 0.9933

8 39.9 -1.5033 2.2600

9 40.3 -1.1033 1.2173

10 39.2 -2.2033 4.8547

11 52.5 11.0967 123.1360

12 43.7 2.2967 5.2747

13 43.0 1.5967 2.5493

14 37.9 -3.5033 12.2733

15 54.9 13.4967 182.1600

16 48.4 6.9967 48.9533

17 57.9 16.4967 272.1400

18 34.9 -6.5033 42.2933

19 32.4 -9.0033 81.0600

20 37.6 -3.8033 14.4653

21 36.3 -5.1033 26.0440

22 44.6 3.1967 10.2187

23 40.5 -0.9033 0.8160

24 34.2 -7.2033 51.8880

25 32.2 -9.2033 84.7013

26 31.9 -9.5033 90.3133

27 27.7 -13.7033 187.7813

28 61.7 20.2967 411.9547

29 48.7 7.2967 53.2413

30 48.7 7.2967 53.2413

Sum 1242.1 1918.1097

Mean 41.40333333 Sample Variance 66.14171264

(divide Sum by 30) (divide Col 4 sum by 29, one less than the sample size)

Sample Standard Deviation

(sqrt of variance) 8.132755538

Note: The results of the calculations can be checked by using the spreadsheet functions var( ) and

stdev( ) in Excel. However, for the purposes of demonstrating understanding of the calculations,

you must show work similar to the table above.

Page 6 of 6

CONCLUSION

In January, 2006 in Purcellville, Virginia, the 30 sampled temperatures fell between 27.7 and

61.7 degrees, for a range of 34 degrees. Temperatures tended to be concentrated in the upper

30’s and low 40’s, as shown the histogram.

The median temperature is 40.1° and the mean temperature is 41.4°, with standard deviation

8.13°. The temperature data distribution is approximately bell-shaped.

As mentioned at the beginning of this report, January of 2006 seemed to be unusually warm. The

analysis in this project agrees with this conjecture. In looking at the website www.weather.com, I

found that the average daily HIGH temperature for January (in any year) in Purcellville is 42

degrees. My analysis found an average of ALL sampled temperatures (not merely the daily

highs) to be 41.4, not much below the typical daily high. [Remark: The average of the data, 41.4, is a statistic – it is the average temperature for the sample. It is possible that

the average of all January temperature readings is somewhat different. If we were familiar with the techniques of

inferential statistics, we could assess whether we can take this statistic and use it in making a statistical inference.]

FINAL REMARKS: This sample project could have been done without the use of a spreadsheet or fancy software, if the frequency distribution, and histogram were carefully hand-drawn or typed. I have added

considerable commentary to the project items, to indicate what I was thinking about when completing the tasks. You

can be less “wordy,” but be sure that your work and summary are detailed and informative, and you show

calculations as requested.