For you

immanuosa
Workshop2-1.pdf

771766: Fundamentals of Data Science.

1

Fundamentals of Data Science. Workshop Number 2

(11 & 12 /Nov/2021).

Aims of the workshop. We have come a long way! With data cleaning and statistics under our belts, the main focus of this particular workshop will be on the PROJECT that will be due in later in the semester. We note explicitly here that data visualization will be a focus in the coming weeks, but we are now ready to start looking at the project data in more detail.

Workshop Timetable. The timetable below should be taken as indicative only. We will modify the times according to how quickly things progress.

Time Activity 10:00-10:20 Discussion and Tutorial on the past weeks /

and Open Forum. 10:20-10:40 Discussion and Tutorial on the PROJECT talk. 10:40-12:00 Morning Exercises. 12:00-13:00 Lunch Break. 13:00-13:15 Discussion on Afternoon Exercises. 13:15-14:30 Afternoon Exercises. 14:30-14:45 Discussion / Tutorial on Afternoon Exercises. 14:45-15:45 Afternoon Exercises (continued). 15:45-16:00 Wrap up and Reflections.

771766: Fundamentals of Data Science.

2

Useful Information. Throughout this workshop you may find the following useful.

Python Documentation

https://docs.python.org/3.8/

This allows you to lookup core language features of Python 3.8 as well as tangential information about the Python Language. We will refer you back to the Programming module by Ashley Williamson for more notes on this.

Jupyter Notebook Basics

Jupyter itself offers some basic documentation for people new to the editor. These can be found on https://jupyter- notebook.readthedocs.io/en/stable/examples/Notebook/Notebook%20Basics. html

Jupyter can also use markdown cells for text input to describe things. If you wish to annotate each cell as to which Exercise it belongs to, you may find https://jupyter- notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20 Markdown%20Cells.html useful.

Reminder. We encourage you to discuss the contents of the workshop with the delivery team, and any findings you gather from the session.

Workshops are not isolated, if you have questions from previous weeks, or lecture content, please talk to us.

The contents of this workshop are not intended to be 100% complete within the session; as such it’s expected that some of this work be completed outside of the session. Exercises herein represent an example of what to do; feel free to expand upon this. We emphasize that in data science many problems are of an “open” nature, and this sometimes makes things interesting and challenging to determine the right route forward. We will tackle this aspect over the next few weeks.

771766: Fundamentals of Data Science.

3

10:00. Discussion (mini-tutorial / study group) on the past weeks. The past few weeks have revolved around several themes: (a) fitting data with linear regression (b) data management and data wrangling (or munging, if you prefer) (c) statistics Open Questions: (a) We have looked at how to handle data in both lists and dataframes (Pandas). Do you prefer operating in lists, dataframes, or perhaps Microsoft excel? Why? Is there any particular situations you could see yourself using one over another? (b) Which statistic (or statistical test) would be appropriate to apply in what situations? E.g., Is a mean always a good stat to use? (c) Important Question!: Is there such a thing as a “generic approach” to data cleaning? Why or why not? Open Forum. How is the MSc for you? Are there any questions or comments that you would like to raise with me as the MSc director? The “buck stops” with me and I will endeavour to follow up with any item raised. One example of this is the new Wednesday workshops which were requested. 10:20. Discussion (mini-tutorial / study group) on the Project. We issued the project since we last spoke together. This has two components: the presentation, and the written component. I want to spend some time to review the presentation aspect in this part of our discussion. (a) What makes a “good” presentation? (b) What is a good “balance” on any powerpoint slides that you would think about using? (c) What sort(s) of information do you think you should put in to the presentation? (d) How should we break up a talk (i.e. subsections) to best put across the information?

771766: Fundamentals of Data Science.

4

10:40. Morning Exercises. In these morning exercises, I want to try to provide some guidance on the analysis of the project itself. Clearly, each of you will have your own dataset to use for the project, and so the exercises here might produce different results to others taking this module. Exercise 1. Coding. The first step of the project is to read in the csv file. Hopefully this step will now be straight forward given that we’ve been reading in csv files for a few weeks now. Use pandas to do this. Tip 1: You might like to keep the original csv file and operate on a copy of it. Tip 2: You might like to write out updated csv files as you perform changes to it. Label them something sensible, of course. Exercise 2a. Science and Coding. We need to do a “first look” at the data. You might have already done this step before today, but I strongly recommend that you have a good first glance of the data before operating on it. We could do this in Excel itself if we wanted to (no worries from me at this stage!). But as data sets grow increasingly large, this becomes a really (really!) bad way to do things. Instead, I would advise that you either (a) use one of the EDA packages that we highlighted in the consolidation exercises from Week 6, or (b) use .unique() commands in python to deduce what exists in each column (also called a “feature”). Also use .isnull to see if there are nulls in there too. Exercise 2b. Make notes (or make a logbook) on everything that you see in the csv file that you think will need to be cleaned. Keep these notes and update them as you make changes to the original csv file. Use it as a “check list”. Here are some things to look out for that might be more obvious: What blanks are there in the data set? How many? Which columns or rows? What inconsistencies are there inside a single column? (integers versus floats, etc.) Are entries in a single format, with the same number of words you would expect per entry? Exercise 3. Coding: Cleaning. (Focus on Age and Religion). We need to start our cleaning of the data. For each change you make, note it on your logbook / notes of things you have cleaned. Justify to yourself any replacements that you make and note these in your logbook. Some of the easier changes will be changing floats to integers for ages.

771766: Fundamentals of Data Science.

5

Some of the more challenging changes to make will be to impute new values for the blanks. I would suggest that sensible entries can be made for almost everything in this dataset, with the possible exception of religion. You need only make some progress with this, rather than complete it as the later exercises will focus on different columns. If you feel you need to go back (maybe because you discover something new that needs cleaned) that is a good discovery and something that you should certainly undertake. But to undertake the next few exercises, focus firstly on cleaning the ages, and religion columns. Make everything an integer for age (and at minimum impute the median age as the basic option for any blanks), and for religion check there are no blank entries (and if there are, change them to something sensible – perhaps the “mode” religion?). Exercise 4. Coding and Science: Lies! This one is a little bit tougher. I want you to consider if anyone in your census return is lying! There are a variety of reasons (historically, and in the present day) that people lie on censuses. They may lie about their age so that they can be classed as a voter perhaps. They may lie about other things just to show their opposition to the government (or they’re upset at being set “homework” by the authorities). How we go about detecting these lies is not an easy exercise, but we can make progress. Let’s look firstly at Religion. How frequently do different religions appear? Can you make a histogram (bar chart) of the different types of religion from your cleaned data. Is anyone lying about their religious affiliation? How can you tell? Look especially at low frequency religions. How many of them are “real”, and how many do you think are “lies”? Assuming you have identified some “lies” here, the next question is what to do about them? There are several options:

(a) We could impute a different religion if we are confident that we can identify the “truth”.

(b) We could replace it with a “null” (or disregard it) for our later analysis. (c) We could retain it (we need to note why we would do so – what benefit is there in

this?). Whichever option you select, make a note of it in your log book. Exercise 5. Coding and Science: More Lies! We’ve now had a little look at religion. Let’s turn our attention to ages. Is anyone lying about their age? How could we tell? Let’s firstly look at if anyone is older than (say) 122 (roughly the oldest that anyone has ever lived). If there is anyone in that category, they have either lied, or simply can’t recall their age correctly.

771766: Fundamentals of Data Science.

6

Does anyone have a negative age? Identify them as well. Do you think there are “underage” mothers or fathers lying about their ages? Do you think there are any children whose parents have inflated their age (so they can enter the workforce) or deflated their age (so they attend school and parents potentially gain government support cash)? Are there “blanks” (.isnull()) entries that you can sensibly replace (impute) with reasonable values? 13:00 (and 14:30). Afternoon Discussion. The first discussion this afternoon is to think what the “key diagrams” might be to have (or to produce? What diagrams will you need to address the goals of the project? 13:15. Afternoon Exercises. Arguably the most obvious diagram we need is one that shows the ages of the population. We will look at Visualization in detail in our lectures next week. Hence the learning here serves as both an introduction to that, and as a tailored set of exercises directed to the project. Exercise A. Make a histogram (using either matplotlib or seaborn – take your pick!) of the ages of the population. Take your time and choose “sensible” bin widths for the histogram. (How did you decide this? Make a post in the chat and compare notes with others!). Exercise B. We want a bit more detail than just the overall age. Make two further histograms: one for “male” and one for “female” ages. Once again, consider what bin width to use here. Exercise C(i). Demographers often talk about an “age pyramid” to characterize ages versus gender. Let’s have a go at making one ourselves. To do this, we are going to use some REAL data from the UK. You will need the 2 files from Canvas that are called: 2019estimatesMale.csv and 2019estimatesFemale.csv [Both of these files are estimated projections from the Office for National Statistics for 2019.] Firstly, we will need to read them both in and check (in Pandas maybe) that they are all good.

771766: Fundamentals of Data Science.

7

Let’s try to put the data in the format that we need it to be. First step: multiply all “male” counts by -1 (i.e. so that they are negative numbers). The reason for doing this will become obvious very soon. Exercise C(ii). Import the following in to your code in addition to Pandas: import matplotlib.pyplot as plt import seaborn as sns import numpy as np We will also need the male and female data in a single dataframe (this is not 100% necessary, but will help with the next task). Create a new dataframe in Pandas that has the format: Age Group, (negative) number of males, number of females. (note the negative value is what you did in C(i)). You should end up with a dataframe that is similar in format to the following (NB: feel free to use the following as an example in part C(iii) initially if you are having problems creating the histogram data from the csv files.) age_p = pd.DataFrame({'Age': ['100+', '90-99', '80-89', '70-79', '60-69', '50-59', '40-49', '30-39', '20-29', '10-19', '0-9'], 'Male': [-6, -17, -92, -180, -290, -427, -544, -506, -545, -515, -427], 'Female': [4, 15, 76, 176, 314, 487, 634, 642, 579, 535, 447]}) AgeClass = ['100+', '90-99', '80-89', '70-79', '60-69', '50-59', '40-49', '30-39', '20-29', '10- 19', '0-9'] Exercise C(iii). Let’s plot our age pyramid now using a bit of seaborn barplotting. Try this: age_pyramid = sns.barplot(x='Male', y='Age', data=age_p, order=AgeClass, color=('mediumblue'), label='Male') age_pyramid = sns.barplot(x='Female', y='Age', data=age_p, order=AgeClass, color=('darkorange'), label='Female') age_pyramid.legend() plt.title('Age Pyramid') Where ‘Male’ is the name of the column that contains the actual number of males with an age stored in the column ‘Age’. (and similar for Female). You might need to do some wrangling to get the data in to this format.

771766: Fundamentals of Data Science.

8

Let’s also do some labelling: age_pyramid.set(xlabel='Population Count', ylabel='Age Group') Exercise C(iv). Evaluate how your plot looks. Does it communicate the information well? Are there any improvements you can think of? We will talk about C(iv) in detail in our 14:30 discussion. Exercise D. Produce an age population pyramid for your Project Data. NB: This exercise assumes that you have cleaned your ages from this morning. If not, simply use a “dropna” command to get rid of the missing data so that you can at least plot it. You can return to cleaned data later potentially. You will need to bin up your data, and this may take a bit of time to do / think about! Feel free to post your solution to this to the Teams general channel. Look once again at the example dataframe given in C(ii) to help with this. Exercise E. Science. Does the age pyramid from the project look anything like the age pyramid from the government? What are the differences (if any)? Why do you think those differences arise and are they significant? Exercise F., and Extension Work. Science and coding. We will have to make some kind of evaluation about “commuters” inside the Project data set. By “commuter” we mean someone who might live in the town that the census has been taken in, but works in one of the nearby cities, rather than within the town itself. Question for discussion / chat: how would you identify a “commuter”? This is not an easy question as the data given didn’t record this in any systematic way. (Aside: modern era censuses directly ask this). But!: could we infer the existence of commuters somehow? Are there certain classes of people in the census who have to be commuters? Are there certain types of people who are more likely than others to be commuters? Why? Justify your answer.

771766: Fundamentals of Data Science.

9

Once you have decided how you might infer who are commuters, try to determine how many such commuters exist in the data set by finding people who match your criteria. As a fraction of the population then, how many people work outside the town and therefore need to travel significant distances for work? This exercise is clearly fraught with assumptions and guess work. But can you make a rigorous argument about it?