Example Project#
This notebook is a longer exercise for students to complete in the third session of Python for Data Analysis. It has been designed to try to replicate the process of starting a new data analysis project.
The data set used in this example is the The Database of British and Irish Hills v18
and is freely available under a Creative Commons Attribution 4 License, at https://www.hills-database.co.uk/downloads.html
. This data set contains grid reference information for peaks, hills, and cols in Britain.
Step 1: Project set up#
Create a new folder on your local machine in which your new data analysis project will be stored.
Create a virtual environment for the project so that we can install packages into it. Note that you will need to use a shell, or terminal window for this.
Install Pandas into your virtual environment (if not already present), again using the shell.
Create a new notebook for your analysis (feel free to copy this one).
Import Pandas to ensure you have installed it correctly.
# Your Step 1 Code
Step 2: Get and load a data set#
Import Pandas into your analysis notebook.
Find a .CSV file online and save it into your project folder. Ideally, save this in a new folder called ‘data’ within your project folder to stay organised. You are welcome to use the hills data set above if desired.
Load the data into a DataFrame using Pandas.
# Your Step 2 Code
Step 3: Explore your data set#
Read about the
df.info()
method in the Pandas documentation. What does it do?Use it to get some information about your DataFrame. Read the output carefully. What does it tell you?
# Your Step 3 Code
Use
df.describe()
to get more information about your DataFrame. What is the difference between this anddf.info()
?
# Your Step 3 Code
Print the names of the columns of your data set.
Extra: How can we turn this into a list, just using Pandas?
# Your Step 3 Code
What is the other axis of a DataFrame? Print these values.
# Your Step 3 Code
Use the
df.shape()
method to get information about the shape of your DataFrame. What does this output mean? What is the data type returned?How can we access these values directly?
# Your Step 3 Code
Read about the
df.head()
anddf.tail()
methods in the Pandas documentation. What do they do?Use these methods with
n=10
on your DataFrame. What isn
?In separate cells, try different values of
n
.
# Your Step 3 Code
Explore the DataFrame columns. This will look different for each of you. Do not perform any boolean indexing/filtering yet.
# Your Step 3 Code
In the hills data set example:
Identify the unique countries that the hills belong to in the DataFrame
Identify the regions of hills in the DataFrame
Identify the min, max, and mean height of all hills
# Your Step 3 Code
This is strange: why are there 7 countries in the
Country
column? Let’s investigate this later.
# Your Step 3 Code
Step 4: Simplify your DataFrame#
It is common to drop columns from a DataFrame that are not useful to you.
Drop a column that looks unimportant in your DataFrame. Dont worry, this can be retrieved by reloading the data from the .csv file.
# Your Step 4 Code
It is in this step that you would perform any data cleaning required: this broad term might include deciding how to handle NaN values, filtering out any rows with other corrupted data, or applying some transformations.
Step 5: Selecting slices of data#
Select all rows with a particular categorical variable
# Your Step 5 Code
In the hills data set example, we will select all hills from Scotland
# Your Step 5 Code
Now select based on two categorical variables, and create a statistic
In the hills data set example, answer the question:
"What is the median hill height of hills not in Scotland and Wales?"
# Your Step 5 Code
Let’s answer another question:
"Which country has the highest mean hill height?"
# Your Step 5 Code
However, we can do this in a better way, using the Pandas built in functions
# Your Step 5 Code
A final question:
"What percentage of hills in the data set are above 1000 meters in height?"
"What are the names of the tallest 5 hills?"
# Your Step 5 Code
Most of the difficulty is knowing which Pandas functions to use! With small data sets aspects, like computational speed do not matter as much as with large data sets. For larger data sets, using Pandas built in methods as much as possible is an easy way to ensure your code runs quickly.
Step 6: Plotting with Pandas Matplotlib#
Use the built in Pandas plotting functions to plot an aspect of your data.
Pandas plotting functions are based on the Matplotlib API. These are mostly a 1:1 mapping, but there are some differences.
In general, Pandas plotting functions can be useful for quickly creating plots. For detailed customisation, plotting with Matplotlib directly might save time.
Tips:
First install Matplotlib in your virtual environment: Pandas needs access to this.
In the hills example, let’s plot the number of hills in our data set.
# Your Step 6 Code
In the hills example, let’s plot the number of hills above or equal to a threshold height.
# Your Step 6 Code
In the hills example, we have some lat, lon data. Let’s plot this using a scatter plot.
# Your Step 6 Code
In the hills example, let’s colour the points by country.
# Your Step 6 Code
To add a legend, things are getting sufficiently complicated that moving to a more verbose Matplotlib plotting structure is helpful.
In the hills example, we will plot in a loop.
# Your Step 6 Code