Example Project#

This notebook is a longer exercise for students to complete in the third session of Python for Data Analysis. It has been designed to try to replicate the process of starting a new data analysis project.

The data set used in this example is the The Database of British and Irish Hills v18 and is freely available under a Creative Commons Attribution 4 License, at https://www.hills-database.co.uk/downloads.html. This data set contains grid reference information for peaks, hills, and cols in Britain.

Step 1: Project set up#

  • Create a new folder on your local machine in which your new data analysis project will be stored.

  • Create a virtual environment for the project so that we can install packages into it. Note that you will need to use a shell, or terminal window for this.

  • Install Pandas into your virtual environment (if not already present), again using the shell.

  • Create a new notebook for your analysis (feel free to copy this one).

  • Import Pandas to ensure you have installed it correctly.

# Your Step 1 Code 

Step 2: Get and load a data set#

  • Import Pandas into your analysis notebook.

  • Find a .CSV file online and save it into your project folder. Ideally, save this in a new folder called ‘data’ within your project folder to stay organised. You are welcome to use the hills data set above if desired.

  • Load the data into a DataFrame using Pandas.

# Your Step 2 Code 

Step 3: Explore your data set#

  • Read about the df.info() method in the Pandas documentation. What does it do?

  • Use it to get some information about your DataFrame. Read the output carefully. What does it tell you?

# Your Step 3 Code 
  • Use df.describe() to get more information about your DataFrame. What is the difference between this and df.info()?

# Your Step 3 Code 
  • Print the names of the columns of your data set.

  • Extra: How can we turn this into a list, just using Pandas?

# Your Step 3 Code 
  • What is the other axis of a DataFrame? Print these values.

# Your Step 3 Code 
  • Use the df.shape() method to get information about the shape of your DataFrame. What does this output mean? What is the data type returned?

  • How can we access these values directly?

# Your Step 3 Code 
  • Read about the df.head() and df.tail() methods in the Pandas documentation. What do they do?

  • Use these methods with n=10 on your DataFrame. What is n?

  • In separate cells, try different values of n.

# Your Step 3 Code 
  • Explore the DataFrame columns. This will look different for each of you. Do not perform any boolean indexing/filtering yet.

# Your Step 3 Code 
  • In the hills data set example:

    • Identify the unique countries that the hills belong to in the DataFrame

    • Identify the regions of hills in the DataFrame

    • Identify the min, max, and mean height of all hills

# Your Step 3 Code 
  • This is strange: why are there 7 countries in the Country column? Let’s investigate this later.

# Your Step 3 Code 

Step 4: Simplify your DataFrame#

It is common to drop columns from a DataFrame that are not useful to you.

  • Drop a column that looks unimportant in your DataFrame. Dont worry, this can be retrieved by reloading the data from the .csv file.

# Your Step 4 Code 
  • It is in this step that you would perform any data cleaning required: this broad term might include deciding how to handle NaN values, filtering out any rows with other corrupted data, or applying some transformations.

Step 5: Selecting slices of data#

  • Select all rows with a particular categorical variable

# Your Step 5 Code 
  • In the hills data set example, we will select all hills from Scotland

# Your Step 5 Code 
  • Now select based on two categorical variables, and create a statistic

  • In the hills data set example, answer the question: "What is the median hill height of hills not in Scotland and Wales?"

# Your Step 5 Code 
  • Let’s answer another question:

  • "Which country has the highest mean hill height?"

# Your Step 5 Code 
  • However, we can do this in a better way, using the Pandas built in functions

# Your Step 5 Code 
  • A final question:

  • "What percentage of hills in the data set are above 1000 meters in height?"

  • "What are the names of the tallest 5 hills?"

# Your Step 5 Code 
  • Most of the difficulty is knowing which Pandas functions to use! With small data sets aspects, like computational speed do not matter as much as with large data sets. For larger data sets, using Pandas built in methods as much as possible is an easy way to ensure your code runs quickly.

Step 6: Plotting with Pandas Matplotlib#

  • Use the built in Pandas plotting functions to plot an aspect of your data.

  • Pandas plotting functions are based on the Matplotlib API. These are mostly a 1:1 mapping, but there are some differences.

  • In general, Pandas plotting functions can be useful for quickly creating plots. For detailed customisation, plotting with Matplotlib directly might save time.

Tips:

  • First install Matplotlib in your virtual environment: Pandas needs access to this.

  • In the hills example, let’s plot the number of hills in our data set.

# Your Step 6 Code 
  • In the hills example, let’s plot the number of hills above or equal to a threshold height.

# Your Step 6 Code 
  • In the hills example, we have some lat, lon data. Let’s plot this using a scatter plot.

# Your Step 6 Code 
  • In the hills example, let’s colour the points by country.

# Your Step 6 Code 
  • To add a legend, things are getting sufficiently complicated that moving to a more verbose Matplotlib plotting structure is helpful.

  • In the hills example, we will plot in a loop.

# Your Step 6 Code