Date science in python

sample-assignment1-201920.html

Home >Computer Science homework help >Date science in python

COMP41680 - Sample API Assignment¶

In [5]:

import os
import urllib.request
import csv
import pandas as pd

Task 1: Identify one or more suitable web APIs¶

API Chosen:

A single API that was chosen for this assignment was that provided by www.worldweatheronline.com

Specifically, the historic weather data API - http://developer.worldweatheronline.com/api/docs/historical-weather-api.aspx

The API is no longer freely available but they give out a free 60 day trial upon signing up, this entitles the user to 500 calls to the API per day.

The API key I received which works here is fbaf429501ff4c7f92b8463217d103

In [3]:

api_key = "fbaf429501ff4c7f92b8463217d103"

Task 2: Collect data your chosen API(s)¶

Collecting Raw Data - Functions needed:

The following 3 functions were written to allow multiple calls of the API as only limited data is available per call.

These function are commented throughout and are called below:

In [4]:

#create a file with set headings - 2 diff types of data to store
def create_file (file_loc, headings): 
    with open(file_loc, "w",newline='') as write_file: #as in get_and_write_data function 
        f = csv.writer(write_file)
        f.writerow(headings)
    write_file.close()
    
#function to call the API, retreive the raw csv data, and write to a file
def get_and_write_data(link, file_loc): 
    response = urllib.request.urlopen(link)
    html = response.read().decode()
    with open(file_loc, "a",newline='') as write_file: #open the file / create it, newline ='' to prevent blank lines being written 
        f = csv.writer(write_file)
        lines = html.strip().split("\n")
        for l in lines:
            if l[0] =="#": # prevent it from writing the comments in the return of each API call
                continue
            elif l[0:10] in ["Not Availa", "There is n"]: #prevent it from writing lines  where no data is present (i.e. returns saying - "Not Available" or "There is no weather data available for the date provided. Past data is available from 1 July, 2008 onwards only.")
                continue
            else: #if doesn't have those it is data and so should be written
                l = l.split(",") #it comes in as a String, so convert to a list for later easier writing and manipulation
                f.writerow(l)
                #print ("Line Written")
    write_file.close()
    #return print ("Monthly Data Appending to Raw File - Completed")

# function to take in parameters set and then use this data to build a link 
# to be passed into the get_and_write_data function
def get_raw_data(file_loc, api_key, location, year, month): #month needs to be a string to avoid invalid token errors for ints as the API needs a leading 0 for single digit months
    while year <=2016: #iterate for all years available in api, namely July 2008 to date
        if month == "02": #need to change end date in the call to the API as it doesn't return full values if the date doesn't exist, e.g. 31st of February
            end_day = "28"
        elif month in ["04", "06", "09", "11"]:
            end_day = "30"
        else: 
            end_day = "31"  
        # the bulding of the link is what decides the data returned, it's available in hourly intervals, 
        # for any location in different formats, the documentation below outlines the possibilities
        # http://developer.worldweatheronline.com/api/docs/historical-weather-api.aspx
        link = "http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key="+ api_key + "&q="+ location +"&format=csv&date="+ str(year) + "-"+ month +"-01&enddate="+ str(year) + "-"+ month +"-"+ str(end_day) +"&tp=24"#
        get_and_write_data(link, file_loc)
        year = year+1

Task 3: Parse the collected data, and store it in an appropriate file format¶

Collecting Raw Data and writing raw data to CSV:

The following code retreives the raw data using the above Functions from the API and writes it to a CSV file.

This data needs extensive cleaning and manipulation before it can be used.

In [98]:

###Set Variable get the raw data from the API and store in the File location set here
location = "Dublin"
raw_file_loc = "weather-data-raw.csv"
create_file (raw_file_loc, " ") # create a file with no headings to store the raw data, no headings needed as the data returns 2 distinct CSV lines with different # of columns

# the api only returns 1 month worth of data at a time 
# so a loop to iterate over all months beginning at Jan is needed
# the API needs the month in 0x format, therefore months 1-9 need to have a 0 added to the front,
# therefore the conversion between int and str was necessary here and a string is passed through to the funtion 
month = 1
while month <= 12: 
    if month <10: 
        month = "0" + str(month)
    else:
        month = str(month)        
    get_raw_data(raw_file_loc, api_key, "location", 2008, month)
    month = int(month)+1
print("Raw Data Collection Completed \n")

Begin Raw Data Collection 

Raw Data Collection Completed

Task 4: Load and represent the data using an appropriate data structure. Apply any pre-processing steps to clean/filter/combine the data¶

Parsing Raw Data:

The raw data returns alternating lines of values, 8 CSVs for each day, and 24 columns of "Hourly data" which is also a daily average as the call to the API has been configured.

These need to be parsed and the data that is to be used later saved, while only 1 data set was needed, I decided to keep and write both sets to different files, for future proofing.

In [ ]:

hourly_file = "weather-data-hourly.csv"
daily_file = "weather-data-daily.csv"
#these are the headings as provided by the API documentation
hourly_headings = ["date","time","tempC","tempF","windspeedMiles","windspeedKmph","winddirdegree","winddir16point","weatherCode","weatherIconUrl","weatherDesc","precipMM","humidity","visibilityKm","pressureMB","cloudcover","HeatIndexC","HeatIndexF","DewPointC","DewPointF","WindChillC","WindChillF","WindGustMiles","WindGustKmph","FeelsLikeC","FeelsLikeF"]
daily_headings = ["date","maxtempC","maxtempF","mintempC","mintempF","sunrise","sunset","moonrise","moonset"]

#call on the function to create the files as needed
create_file(hourly_file, hourly_headings)
create_file(daily_file, daily_headings)

# open the raw data and then based on the length of the line, write to appropriate file
# the len of the lines is actually around 58-62 and over 180, so 100 was chosen for safety, 
# this can be easily changed in future if the API changes
raw_data = open(raw_file_loc, "r")
lines = raw_data.readlines()
for l in lines:
 #   print (len(l))
    if len(l) <= 100: 
        with open(daily_file, "a",newline='') as daily:
            df = csv.writer(daily)
            l = l.split(",")
            df.writerow(l) 
            daily.close()
    elif len(l) >101:
        with open(hourly_file, "a",newline='') as hourly:
            hf = csv.writer(hourly)
            l = l.split(",")
            hf.writerow(l) 
            hourly.close()
    else:
        continue
raw_data.close()

Utilising Pandas and further Data Modification

With the CSV files written these are imported using Pandas.
2 columns were chosen for analysis, namely Temperature and Precipitation for each day
The date field was stored as a String, so this was converted to a Datetime to allow for time analysis.

In [5]:

hourly_data = pd.read_csv(hourly_file)  
daily_data = pd.read_csv(daily_file)

#convert date string to datetime - http://stackoverflow.com/questions/17134716/convert-dataframe-column-type-from-string-to-datetime
pd.options.mode.chained_assignment = None  # default='warn' ## suppress warning regarding A value is trying to be set on a copy of a slice from a DataFrame. - same warning was appearing using a For loop, the index and .loc, and that loop took 5 minutes to run on my machine
hourly_data['date'] = pd.to_datetime(hourly_data['date']) #  removed from to_datetime {, format="YYYY-MM-DD"}
#for i in simplified_data.index:
#    simplified_data.loc[i,'date']=pd.to_datetime(simplified_data.loc[i, 'date'])

simplified_data = hourly_data[["date", "tempC", "precipMM"]] # extract temp and precip data for analysis and visualisation
simplified_data = simplified_data.sort_values(by=['date']) # reorder the data by date

In [101]:

hourly_data[0:5]

Out[101]:

	date	time	tempC	tempF	windspeedMiles	windspeedKmph	winddirdegree	winddir16point	weatherCode	weatherIconUrl	...	HeatIndexC	HeatIndexF	DewPointC	DewPointF	WindChillC	WindChillF	WindGustMiles	WindGustKmph	FeelsLikeC	FeelsLikeF
0	2009-01-01	24	7	44	11	17	94	E	113	http://cdn.worldweatheronline.net/images/wsymb...	...	7	44	2	35	4	38	13	20	4	38
1	2009-01-02	24	6	42	18	28	121	ESE	116	http://cdn.worldweatheronline.net/images/wsymb...	...	6	42	2	35	1	33	22	36	1	33
2	2009-01-03	24	5	41	14	23	172	S	113	http://cdn.worldweatheronline.net/images/wsymb...	...	5	41	-1	31	3	37	7	11	3	37
3	2009-01-04	24	6	43	12	19	314	NW	122	http://cdn.worldweatheronline.net/images/wsymb...	...	6	42	3	37	3	37	10	17	3	37
4	2009-01-05	24	5	42	17	28	100	E	176	http://cdn.worldweatheronline.net/images/wsymb...	...	5	41	1	33	0	33	26	42	0	33

5 rows × 26 columns

In [102]:

simplified_data[0:5]

Out[102]:

	date	tempC	precipMM
1329	2008-07-01	15	15.8
1330	2008-07-02	16	9.9
1331	2008-07-03	14	25.5
1332	2008-07-04	15	4.6
1333	2008-07-05	16	37.5

Missing Data

Final Pre-Processing steps are to look for missing data to see if further pre-processing is needed.

In [138]:

#look for missing data
simplified_data.isnull().sum() # no missing values in the reduced dataset

Out[138]:

date        0
tempC       0
precipMM    0
dtype: int64

In [104]:

simplified_data.dtypes.value_counts()

Out[104]:

float64           1
int64             1
datetime64[ns]    1
dtype: int64

There's no Null's in the data, there's also no strings either, this means there's therefore no values in it such as "Not Available" or for example "No moonrise" in moonrise column, etc.

Both of these are highly indicative that all values are present.

The final Pre-processing step is to get Monthly averages to create a reduced size data set that can be easier visualised, but still accurate and indicative of the months rain and temperature.

In [107]:

monthly = simplified_data.groupby([pd.Grouper(key='date',freq='M')]) # http://stackoverflow.com/questions/32982012/grouping-dataframe-by-custom-date
avg_month = monthly.mean() #create a new DF based on the mean of the groupby object created above
print(avg_month[0:5])

                tempC  precipMM
date                           
2008-07-31  16.419355  8.290323
2008-08-31  17.064516  7.906452
2008-09-30  15.833333  5.186667
2008-10-31  13.032258  5.477419
2008-11-30  10.766667  3.250000

Task 5: Analyse and summarise the cleaned dataset¶

Descriptive Statistics

Initially of the Data Set containing all daily data:

In [110]:

print("\nSimplified_data columnns:\n" + str(simplified_data.columns) + "\n")
print("Simplified_data Descriptive Stats:\n")
print(simplified_data.describe())

Simplified_data columnns:
Index(['date', 'tempC', 'precipMM'], dtype='object')

Simplified_data Descriptive Stats:

             tempC     precipMM
count  2740.000000  2740.000000
mean     12.941241     3.267628
std       4.517523     5.706110
min       2.000000     0.000000
25%      10.000000     0.100000
50%      13.000000     1.100000
75%      16.000000     3.800000
max      25.000000    52.400000

In [111]:

print("Descriptive Stats:\n")
print(avg_month.describe())

Descriptive Stats:

           tempC   precipMM
count  92.000000  92.000000
mean   12.742613   3.327383
std     3.926780   2.076563
min     4.000000   0.100000
25%     9.403226   1.858871
50%    12.768817   2.784516
75%    16.108333   4.209516
max    21.354839  12.500000

As can be seen from comparing both descriptive stats, the monthly average, seems to have removed outliers (e.g. max precipitation 52mm), has reduced the standard deviation, but the quartiles have remained largely the same.

Matplotlib and Pandas Graphing

In [112]:

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

Line Graphs and Area Plot¶

In [28]:

plt.figure()
avg_month.plot()
plt.title("Avg Monthly Temperature and Precipitation in Dublin since July 2008\n")
plt.ylabel("Temperature C | Precipitation MM")
plt.xlabel("Time")
plt.show()

<matplotlib.figure.Figure at 0x245e32cd9e8>

In [35]:

avg_month.plot.area(stacked=False)

Out[35]:

<matplotlib.axes._subplots.AxesSubplot at 0x245e5c3d5c0>

Basic Line Graph and Area Plot show how temp and precip interact, as expected Temp increases and falls based on time of year.

Precipitation doesn't seem to follow the same expected trend. It seems the Irish reputation for never ending rain is well deserved, although it would appear to have fallen in recent years.

Stacked Histogram¶

Shows the distribution of the data.

In [33]:

avg_month.plot.hist(stacked=True)

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x245e5a93860>

ScatterPlots¶

Explore the data, look for patterns, outliers, etc.

In [38]:

avg_month.plot.scatter(x="tempC", y="precipMM", s=50 )

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x245e6cb6cf8>

In [114]:

plt.scatter(avg_month['tempC'], avg_month['precipMM'])
plt.show()

In [39]:

from pandas.tools.plotting import scatter_matrix
scatter_matrix(avg_month, alpha=0.2, figsize=(6, 6), diagonal='kde')

Out[39]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000245E6D16278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E6DAC5F8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000245E6DF5F98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E6E30D30>]], dtype=object)

In [40]:

from pandas.tools.plotting import scatter_matrix
scatter_matrix(daily_data, alpha=0.2, figsize=(6, 6), diagonal='kde')

Out[40]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000245E6C462B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E7606278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E764F470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E76859E8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000245E76CEB38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E6D7B160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E4FA96D8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E4F505C0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000245E77666A0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E77B08D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E77ED470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E78354A8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000245E7875C50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E78C4278>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E78FDA90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000245E7949CC0>]], dtype=object)

Dual Axis Line Graphs¶

In [54]:

plt.figure()

ax = avg_month.plot(secondary_y=['precipMM'])
ax.set_ylabel("Temperature C")
ax.right_ax.set_ylabel("Precipitation MM")

plt.title("Avg Monthly Temperature and Precipitation in Dublin since July 2008\n")
plt.xlabel("Time")

plt.show()

<matplotlib.figure.Figure at 0x245e8cd2208>

In [55]:

avg_month.plot(subplots=True, figsize=(6, 6));

Final Manipulation, Exploration and Visualisation¶

Temperature v Precipitation¶

For the purposes of this exploration, 2 new Data Frames were created, grouping by Temp and Precipitation and comparing to the mean value of the other, this data was then explored as outlined further below

In [132]:

#x="tempC", y="precipMM"
avg_month_temp = avg_month.groupby("tempC") 
temp_data = avg_month_temp.mean() #create a new DF based on the mean of the groupby object created above
print(temp_data[4:7])

           precipMM
tempC              
4.000000  12.500000
5.333333   0.100000
6.322581   4.441935
6.392857   2.585714
6.903226   2.574194

In [135]:

plt.figure()
avg_month_temp.mean().plot()#secondary_y=['precipMM'])
plt.title("Avg amount (mm) of Precipitation as Temperature increases (Dublin since July 2008)\n")
plt.xlabel("Temperature - C")
plt.ylabel("Precipitation MM")
plt.show()

<matplotlib.figure.Figure at 0x245e039cf98>

In [136]:

#x="tempC", y="precipMM"
avg_month_precip = avg_month.groupby("precipMM") 
precip_data = avg_month_precip.mean() #create a new DF based on the mean of the groupby object created above
print(precip_data[0:1])

              tempC
precipMM           
0.100000   5.333333
0.722581  11.387097
0.733333  19.166667

In [137]:

plt.figure()
avg_month_precip.mean().plot()#secondary_y=['precipMM'])
plt.title("Avg Temperature as amount of Precipitation increases (Dublin since July 2008)\n")
plt.xlabel("Precipitation - MM")
plt.ylabel("Temperature - C")
plt.show()

<matplotlib.figure.Figure at 0x245e0386b38>

Tentative Conclusion¶

Further in-depth studies and tests could be carried out to test the statistical significant of the results, and to incorporate other meterological datasets. However, based on the current data, there does not seem to be a strong relationship between level of rain compared to temperature.

So it doesn't really matter how hot it gets in Dublin, we can still expect rain!