STAT 19000: Project 2 — Spring 2022
Motivation: In Python it is very important to understand some of the data types in a little bit more depth than you would in R. Many of the data types in Python will seem very familiar. A character
in R is similar to a str
in Python. An integer
in R is an int
in Python. A float
in R is similar to a float
in Python. A logical
in R is similar to a bool
in Python. In addition to all of that, there are some very popular classes that are introduced in packages like numpy
and pandas
. On the other hand, there are some data types in Python like tuples
, lists
, sets
, and dicts
that diverge from R a little bit more. It is integral to understand some of these before jumping too far into everything.
Context: This is the second project introducing some basic data types, and demonstrating some familiar control flow concepts, all while digging right into a dataset.
Scope: dicts, sets, pandas, matplotlib
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/noaa/2020_sample.csv
Questions
Question 1
In the previous project, we started to get a feel for how lists and tuples work. As a part of this, we had you use the csv
package to read in and process data. While this can certainly be useful, and is an efficient way to handle large amounts of data, it takes a lot of work to get the data in a format where you can use it.
As teased in the previous project, Python has a very popular package called pandas
that is popular to use for many data-related tasks. If you need to understand 1 thing about pandas
it would be that it provides 2 key data types that you can take advantage of: the Series
and the DataFrame
types. Each of those objects have a ton of built in attributes and methods. We will talk about this more in the future, but you can think of an attribute as a piece of data within the object. You can think of a method as a function closely associated with the object or class. Just know that the attributes and methods provide lots of powerful features!
Please read the fantastic and quick 10 minute introduction to pandas
here. We will be slowly introducing bits and pieces of this package throughout the semester. In addition, we will also start incorporating some plotting questions throughout the semester.
Read in the dataset: /depot/datamine/data/noaa/2020_sample.csv
using the pandas
package, and store it in a variable called df
.
Our dataset doesn’t have column headers, but headers are useful. Use the
|
Remember in the previous project how we had you print the first 10 values of a certain column? This time, use the head
method to print the first 10 rows of data from our dataset. Do you think it was easier or harder than doing something similar using the csv
package?
-
Code used to solve this problem.
-
Output from running the code.
View solution
Unresolved include directive in modules/ROOT/pages/spring2022/19000/19000-s2022-project02.adoc - include::book:projects:example$19000-s2022-project02-q02-sol.adoc[]
Question 2
Imagine going back and using the csv
package to first count the number of rows of data, and then count the number of columns. Seems like a lot of work for just getting a little bit of information about your data, right? Using pandas
this is much easier.
Use one of the attributes from your DataFrame in combination with f-strings to print the following:
There are 123 columns in the DataFrame! There are 321 rows in the DataFrame!
I’m not asking you to literally print the numbers 123 and 321 — replace those numbers with the actual values. |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Dictionaries, often referred to as dicts, are really powerful. There are two primary ways to "get" information from a dict. One is to use the get method, the other is to use square brackets and strings. Test out the following to understand the differences between the two.
my_dict = {"fruits": ["apple", "orange", "pear"], "person": "John", "vegetables": ["carrots", "peas"]}
# If "person" is indeed a key, they will function the same way
my_dict["person"]
my_dict.get("person")
# If the key does not exist, like below, they will not
# function the same way.
my_dict.get("height") # Returns None when key doesn't exist
print(my_dict.get("height")) # By printing, we can see None in this case
my_dict["height"] # Throws a KeyError exception because the key, "height" doesn't exist
Under the hood, a dict is essentially a data structure called a hash table. Hash tables are a data structure with a useful set of properties. The time needed for searching, inserting, or removing a piece of data has a constant average lookup time. This means that no matter how big your hash table grows to be, inserting, searching, or deleting a piece of data will usually take about the same amount of time. (The worst case time increases linearly.) Dictionaries (dict) are used a lot, so it is worthwhile to understand them.
Dicts can also be useful to solve small tasks here and there. For example, what if we wanted to figure out how many times each of the unique station_id
value appears? Dicts are a great way to solve this! Use the provided code to extract a list of station_id
values from our DataFrame. Use the resulting list, a dict, and a loop to figure this out.
import pandas as pd
station_ids = df["station_id"].dropna().tolist()
You should get the following results. Results
print(my_dict['US1MANF0058']) # 378 print(my_dict['USW00023081']) # 1290 print(my_dict['US10sali004']) # 13 |
If you get a
|
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Sets are very useful! I’ve created a nearly identical copy of our dataset here: /depot/datamine/data/noaa/2020_sampleB.csv
. The "sampleB" dataset has one key difference — I’ve snuck in a fake row of data! There is 1 row in the new dataset that is not in the old — it can be identified by having a station_id
that doesn’t exist in the original dataset. Print the "intruder" row of data.
There are 15000000 rows in the data frame. So this method will take too long, because it requires 15000000 times 15000001 comparisons to find the intruder:
It would eventually work, but it will take way too long to finish. Same problem will occur here:
The
|
We need to use our
We can also do this using the
|
Check out this great article on sets. |
Now that you found the |
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Run the following to see a very simple example of using matplotlib
.
import matplotlib.pyplot as plt
# now you can use it, for example
plt.plot([1,2,3,5],[5,6,7,8])
plt.show()
plt.close()
There are a myriad of great examples and tutorials on how to use matplotlib
. With that being said, it takes a lot of practice to become comfortable creating graphics.
Read through the provided links and search online. Describe something you would like to plot from our dataset. Use any of the tools you’ve learned about to extract the data you want and create the described plot. Do your best to get creative, but know that expectations are low — this is (potentially) the very first time you are using matplotlib
and we are asking you do create something without guidance. Just do the best you can and post questions in Piazza if you get stuck! The "best" plot will get featured when we post solutions after grades are posted.
You could use this as an opportunity to practice with dicts, sets, and lists. You could also try and learn about and use some of the features that we haven’t mentioned yet (maybe something from the 10 minute intro to pandas). Have fun with it! |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |