STAT 19000: Project 2 — Spring 2022

Motivation: In Python it is very important to understand some of the data types in a little bit more depth than you would in R. Many of the data types in Python will seem very familiar. A character in R is similar to a str in Python. An integer in R is an int in Python. A float in R is similar to a float in Python. A logical in R is similar to a bool in Python. In addition to all of that, there are some very popular classes that are introduced in packages like numpy and pandas. On the other hand, there are some data types in Python like tuples, lists, sets, and dicts that diverge from R a little bit more. It is integral to understand some of these before jumping too far into everything.

Context: This is the second project introducing some basic data types, and demonstrating some familiar control flow concepts, all while digging right into a dataset.

Scope: dicts, sets, pandas, matplotlib

Learning Objectives
  • Demonstrate the ability to read and write data of various formats using various packages.

  • Explain what is a dict is and why it is useful.

  • Understand how a set works and when it could be useful.

  • List the differences between lists & tuples and when to use each.

  • Gain familiarity with dict methods.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/noaa/2020_sample.csv

Questions

Question 1

In the previous project, we started to get a feel for how lists and tuples work. As a part of this, we had you use the csv package to read in and process data. While this can certainly be useful, and is an efficient way to handle large amounts of data, it takes a lot of work to get the data in a format where you can use it.

As teased in the previous project, Python has a very popular package called pandas that is popular to use for many data-related tasks. If you need to understand 1 thing about pandas it would be that it provides 2 key data types that you can take advantage of: the Series and the DataFrame types. Each of those objects have a ton of built in attributes and methods. We will talk about this more in the future, but you can think of an attribute as a piece of data within the object. You can think of a method as a function closely associated with the object or class. Just know that the attributes and methods provide lots of powerful features!

Please read the fantastic and quick 10 minute introduction to pandas here. We will be slowly introducing bits and pieces of this package throughout the semester. In addition, we will also start incorporating some plotting questions throughout the semester.

Read in the dataset: /depot/datamine/data/noaa/2020_sample.csv using the pandas package, and store it in a variable called df.

Our dataset doesn’t have column headers, but headers are useful. Use the names argument to the read_csv method to give the dataframe a column header.

import pandas as pd

df = pd.read_csv('/depot/datamine/data/noaa/2020_sample.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"])

Remember in the previous project how we had you print the first 10 values of a certain column? This time, use the head method to print the first 10 rows of data from our dataset. Do you think it was easier or harder than doing something similar using the csv package?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

View solution

Unresolved include directive in modules/ROOT/pages/spring2022/19000/19000-s2022-project02.adoc - include::book:projects:example$19000-s2022-project02-q02-sol.adoc[]

Question 2

Imagine going back and using the csv package to first count the number of rows of data, and then count the number of columns. Seems like a lot of work for just getting a little bit of information about your data, right? Using pandas this is much easier.

Use one of the attributes from your DataFrame in combination with f-strings to print the following:

There are 123 columns in the DataFrame!
There are 321 rows in the DataFrame!

I’m not asking you to literally print the numbers 123 and 321 — replace those numbers with the actual values.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Dictionaries, often referred to as dicts, are really powerful. There are two primary ways to "get" information from a dict. One is to use the get method, the other is to use square brackets and strings. Test out the following to understand the differences between the two.

my_dict = {"fruits": ["apple", "orange", "pear"], "person": "John", "vegetables": ["carrots", "peas"]}

# If "person" is indeed a key, they will function the same way
my_dict["person"]
my_dict.get("person")

# If the key does not exist, like below, they will not
# function the same way.
my_dict.get("height") # Returns None when key doesn't exist
print(my_dict.get("height")) # By printing, we can see None in this case
my_dict["height"] # Throws a KeyError exception because the key, "height" doesn't exist

Under the hood, a dict is essentially a data structure called a hash table. Hash tables are a data structure with a useful set of properties. The time needed for searching, inserting, or removing a piece of data has a constant average lookup time. This means that no matter how big your hash table grows to be, inserting, searching, or deleting a piece of data will usually take about the same amount of time. (The worst case time increases linearly.) Dictionaries (dict) are used a lot, so it is worthwhile to understand them.

Dicts can also be useful to solve small tasks here and there. For example, what if we wanted to figure out how many times each of the unique station_id value appears? Dicts are a great way to solve this! Use the provided code to extract a list of station_id values from our DataFrame. Use the resulting list, a dict, and a loop to figure this out.

import pandas as pd

station_ids = df["station_id"].dropna().tolist()

You should get the following results.

Results
print(my_dict['US1MANF0058']) # 378
print(my_dict['USW00023081']) # 1290
print(my_dict['US10sali004']) # 13

If you get a KeyError — don’t forget — you need to initialize the values for each key in the dict to 0 first. To get a unique list of station_id values, we can use the following code.

unique_ids = list(set(station_ids))

set is another built-in type in Python. A useful use of set is that it can reduce a list to unique values very efficiently. Here, we get the unique values and then convert the set back to a list.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Sets are very useful! I’ve created a nearly identical copy of our dataset here: /depot/datamine/data/noaa/2020_sampleB.csv. The "sampleB" dataset has one key difference — I’ve snuck in a fake row of data! There is 1 row in the new dataset that is not in the old — it can be identified by having a station_id that doesn’t exist in the original dataset. Print the "intruder" row of data.

There are 15000000 rows in the data frame. So this method will take too long, because it requires 15000000 times 15000001 comparisons to find the intruder:

import pandas as pd

df_intruder = pd.read_csv('/depot/datamine/data/noaa/2020_sampleB.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"])
intruder_ids = df_intruder["station_id"].dropna().tolist()

for i in intruder_ids:
    if i not in station_ids:
        print(i)

It would eventually work, but it will take way too long to finish. Same problem will occur here: The in operator is useful for checking if a value is in a list. It is, however, essentially the same as what we tried above; it will be way too slow.

for ii in intruder_ids:
    found = False
    for i in station_ids:
        if ii == i:
            found = True
    if not found:
        print(ii)

We need to use our set trick from the question 3, so that we can (instead) make 39962 times 39963 comparisons to find the intruder. For example:

import pandas as pd

df_intruder = pd.read_csv('/depot/datamine/data/noaa/2020_sampleB.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"])
intruder_ids = df_intruder["station_id"].dropna().tolist()

unique_intruder_ids = list(set(intruder_ids))

for i in unique_intruder_ids:
    if i not in unique_ids:
        print(i)

We can also do this using the in operator, and we will get the same result.

for ii in unique_intruder_ids:
    found = False
    for i in unique_ids:
        if ii == i:
            found = True
    if not found:
        print(ii)

Check out this great article on sets.

Now that you found the station_id of the intruder — you will need to use pandas indexing to print the entire row of data. This documentation should help.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Run the following to see a very simple example of using matplotlib.

import matplotlib.pyplot as plt

# now you can use it, for example
plt.plot([1,2,3,5],[5,6,7,8])
plt.show()
plt.close()

There are a myriad of great examples and tutorials on how to use matplotlib. With that being said, it takes a lot of practice to become comfortable creating graphics.

Read through the provided links and search online. Describe something you would like to plot from our dataset. Use any of the tools you’ve learned about to extract the data you want and create the described plot. Do your best to get creative, but know that expectations are low — this is (potentially) the very first time you are using matplotlib and we are asking you do create something without guidance. Just do the best you can and post questions in Piazza if you get stuck! The "best" plot will get featured when we post solutions after grades are posted.

You could use this as an opportunity to practice with dicts, sets, and lists. You could also try and learn about and use some of the features that we haven’t mentioned yet (maybe something from the 10 minute intro to pandas). Have fun with it!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.