ex3
Data Wrangling Assignment
Tidying up Fields in the Data
So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.
Upon inspection, all of the data types are currently the object dtype, which is roughly analogous to str in native Python.
It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:
>>>
>>> df.get_dtype_counts()
object 6
One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:
>>>
>>> df.loc[1905:, 'Date of Publication'].head(10)
Identifier
1905 1888
1929 1839, 38-54
2836 [1897?]
2854 1865
2956 1860-63
2957 1873
3017 1866
3131 1899
4598 1814
4884 1820
Name: Date of Publication, dtype: object
A particular book can have only one date of publication. Therefore, we need to do the following:
· Remove the extra dates in square brackets, wherever present: 1879 [1878]
· Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
· Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
· Convert the string nan to NumPy’s NaN value
Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year:
>>>
regex = r'^(\d{4})'
The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions.
The \d represents any digit, and {4} repeats this rule four times. The ^ character matches the start of a string, and the parentheses denote a capturing group, which signals to Pandas that we want to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)
Let’s see what happens when we run this regex across our dataset:
>>>
>>> extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
>>> extr.head()
Identifier
206 1879
216 1868
218 1869
472 1851
480 1857
Name: Date of Publication, dtype: object
Further Reading: Not familiar with regex? You can inspect the expression above at regex101.com and learn all about regular expressions with Regular Expressions: Regexes in Python.
Technically, this column still has object dtype, but we can easily get its numerical version with pd.to_numeric:
>>>
>>> df['Date of Publication'] = pd.to_numeric(extr)
>>> df['Date of Publication'].dtype
dtype('float64')
This results in about one in every ten values being missing, which is a small price to pay for now being able to do computations on the remaining valid values:
>>>
>>> df['Date of Publication'].isnull().sum() / len(df)
0.11717147339205986
Great! That’s done!
Combining str Methods with NumPy to Clean Columns
Above, you may have noticed the use of df['Date of Publication'].str. This attribute is a way to access speedy string operations in Pandas that largely mimic operations on native Python strings or compiled regular expressions, such as .split(), .replace(), and .capitalize().
To clean the Place of Publication field, we can combine Pandas str methods with NumPy’s np.where function, which is basically a vectorized form of Excel’s IF() macro. It has the following syntax:
>>>
>>> np.where(condition, then, else)
Here, condition is either an array-like object or a Boolean mask. then is the value to be used if condition evaluates to True, and else is the value to be used otherwise.
Essentially, .where() takes each element in the object used for condition, checks whether that particular element evaluates to True in the context of the condition, and returns an ndarray containing then or else, depending on which applies.
It can be nested into a compound if-then statement, allowing us to compute values based on multiple conditions:
>>>
>>> np.where(condition1, x1,
np.where(condition2, x2,
np.where(condition3, x3, ...)))
We’ll be making use of these two functions to clean Place of Publication since this column has string objects. Here are the contents of the column:
>>>
>>> df['Place of Publication'].head(10)
(Insert full-screen snapshot of the answer here)
We see that for some rows, the place of publication is surrounded by other unnecessary information. If we were to look at more values, we would see that this is the case for only some rows that have their place of publication as ‘London’ or ‘Oxford’.
Let’s take a look at two specific entries:
>>>
>>> df.loc[4157862]
>>> df.loc[4159587]
(Insert full-screen snapshot of the answer here)
These two books were published in the same place, but one has hyphens in the name of the place while the other does not.
To clean this column in one sweep, we can use str.contains() to get a Boolean mask.
We clean the column as follows:
>>>
>>> pub = df['Place of Publication']
>>> london = pub.str.contains('London')
>>> london[:5]
(Insert full-screen snapshot of the answer here)
>>> oxford = pub.str.contains('Oxford')
We combine them with np.where:
>>>
df['Place of Publication'] = np.where(london, 'London',
np.where(oxford, 'Oxford',
pub.str.replace('-', ' ')))
>>> df['Place of Publication'].head()
(Insert full-screen snapshot of the answer here)
Here, the np.where function is called in a nested structure, with condition being a Series of Booleans obtained with str.contains(). The contains() method works similarly to the built-in in keyword used to find the occurrence of an entity in an iterable (or substring in a string).
The replacement to be used is a string representing our desired place of publication. We also replace hyphens with a space with str.replace() and reassign to the column in our DataFrame.
Although there is more dirty data in this dataset, we will discuss only these two columns for now.
Let’s have a look at the first five entries, which look a lot crisper than when we started out:
>>>
>>> df.head()
(Insert full-screen snapshot of the answer here)
Note: At this point, Place of Publication would be a good candidate for conversion to a Categorical dtype, because we can encode the fairly small unique set of cities with integers. (The memory usage of a Categorical is proportional to the number of categories plus the length of the data; an object dtype is a constant times the length of the data.)
Cleaning the Entire Dataset Using the applymap Function
In certain situations, you will see that the “dirt” is not localized to one column but is more spread out.
There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame. Pandas .applymap() method is similar to the in-built map() function and simply applies a function to all the elements in a DataFrame.
Let’s look at an example. We will create a DataFrame out of the “university_towns.txt” file:
$ head Datasets/univerisity_towns.txt
(Insert full-screen snapshot of the answer here)
We see that we have periodic state names followed by the university towns in that state: StateA TownA1 TownA2 StateB TownB1 TownB2.... If we look at the way state names are written in the file, we’ll see that all of them have the “[edit]” substring in them.
We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that list in a DataFrame:
>>>
>>> university_towns = []
>>> with open('Datasets/university_towns.txt') as file:
... for line in file:
... if '[edit]' in line:
... # Remember this `state` until the next is found
... state = line
... else:
... # Otherwise, we have a city; keep `state` as last-seen
... university_towns.append((state, line))
>>> university_towns[:5]
(Insert full-screen snapshot of the answer here)
We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. Pandas will take each element in the list and set State to the left value and RegionName to the right value.
The resulting DataFrame looks like this:
>>>
>>> towns_df = pd.DataFrame(university_towns,
... columns=['State', 'RegionName'])
>>> towns_df.head()
(Insert full-screen snapshot of the answer here)
While we could have cleaned these strings in the for loop above, Pandas makes it easy. We only need the state name and the town name and can remove everything else. While we could use Pandas’ .str() methods again here, we could also use applymap() to map a Python callable to each element of the DataFrame.
We have been using the term element, but what exactly do we mean by it? Consider the following “toy” DataFrame:
>>>
0 1
0 Mock Dataset
1 Python Pandas
2 Real Python
3 NumPy Clean
In this example, each cell (‘Mock’, ‘Dataset’, ‘Python’, ‘Pandas’, etc.) is an element. Therefore, applymap() will apply a function to each of these independently. Let’s define that function:
>>>
>>> def get_citystate(item):
... if ' (' in item:
... return item[:item.find(' (')]
... elif '[' in item:
... return item[:item.find('[')]
... else:
... return item
Pandas’ .applymap() only takes one parameter, which is the function (callable) that should be applied to each element:
>>>
>>> towns_df = towns_df.applymap(get_citystate)
First, we define a Python function that takes an element from the DataFrame as its parameter. Inside the function, checks are performed to determine whether there’s a ( or [ in the element or not.
Depending on the check, values are returned accordingly by the function. Finally, the applymap() function is called on our object. Now the DataFrame is much neater:
>>>
>>> towns_df.head()
(Insert full-screen snapshot of the answer here)
The applymap() method took each element from the DataFrame, passed it to the function, and the original value was replaced by the returned value. It’s that simple!
Technical Detail: While it is a convenient and versatile method, .applymap can have significant runtime for larger datasets, because it maps a Python callable to each individual element. In some cases, it can be more efficient to do vectorized operations that utilize Cython or NumPY (which, in turn, makes calls in C) under the hood.
Renaming Columns and Skipping Rows
Often, the datasets you’ll work with will have either column names that are not easy to understand, or unimportant information in the first few and/or last rows, such as definitions of the terms in the dataset, or footnotes.
In that case, we’d want to rename columns and skip certain rows so that we can drill down to necessary information with correct and sensible labels.
To demonstrate how we can go about doing this, let’s first take a glance at the initial five rows of the “olympics.csv” dataset:
$ head -n 5 Datasets/olympics.csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,? Summer,01 !,02 !,03 !,Total,? Winter,01 !,02 !,03 !,Total,? Games,01 !,02 !,03 !,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Now, we’ll read it into a Pandas DataFrame:
>>>
>>> olympics_df = pd.read_csv('Datasets/olympics.csv')
>>> olympics_df.head()
(Insert full-screen snapshot of the answer here)
This is messy indeed! The columns are the string form of integers indexed at 0. The row which should have been our header (i.e. the one to be used to set the column names) is at olympics_df.iloc[0]. This happened because our CSV file starts with 0, 1, 2, …, 15.
Also, if we were to go to the source of this dataset, we’d see that NaN above should really be something like “Country”, ? Summer is supposed to represent “Summer Games”, 01 ! should be “Gold”, and so on.
Therefore, we need to do two things:
· Skip one row and set the header as the first (0-indexed) row
· Rename the columns
We can skip rows and set the header while reading the CSV file by passing some parameters to the read_csv() function.
This function takes a lot of optional parameters, but in this case we only need one (header) to remove the 0th row:
>>>
>>> olympics_df = pd.read_csv('Datasets/olympics.csv', header=1)
>>> olympics_df.head()
(Insert full-screen snapshot of the answer here)
We now have the correct row set as the header and all unnecessary rows removed. Take note of how Pandas has changed the name of the column containing the name of the countries from NaN to Unnamed: 0.
To rename the columns, we will make use of a DataFrame’s rename() method, which allows you to relabel an axis based on a mapping (in this case, a dict).
Let’s start by defining a dictionary that maps current column names (as keys) to more usable ones (the dictionary’s values):
>>>
>>> new_names = {'Unnamed: 0': 'Country',
... '? Summer': 'Summer Olympics',
... '01 !': 'Gold',
... '02 !': 'Silver',
... '03 !': 'Bronze',
... '? Winter': 'Winter Olympics',
... '01 !.1': 'Gold.1',
... '02 !.1': 'Silver.1',
... '03 !.1': 'Bronze.1',
... '? Games': '# Games',
... '01 !.2': 'Gold.2',
... '02 !.2': 'Silver.2',
... '03 !.2': 'Bronze.2'}
We call the rename() function on our object:
>>>
>>> olympics_df.rename(columns=new_names, inplace=True)
Setting inplace to True specifies that our changes be made directly to the object. Let’s see if this checks out:
>>>
>>> olympics_df.head()
(Insert full-screen snapshot of the answer here)