TDM 10100: R Project 1 — 2024
Motivation: The goal of this project is to get you comfortable with the basics of operating in Jupyter notebooks as hosted on Anvil, our computing cluster. If you don’t understand the code you’re running/writing at this point in the course, that’s okay! We are going to go into detail about how everything works in future projects.
Context: You to not need any background to get started in The Data Mine. If you never programmed before, that is totally OK! If you never worked with data before, that is totally OK too! We will spend most of our time learning how to work with data, in a practical way.
Scope: Anvil, Jupyter Labs, Jupyter Notebooks, R, bash
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/flights/subset/airports.csv
-
/anvil/projects/tdm/data/icecream/breyers/reviews.csv
-
/anvil/projects/tdm/data/icecream/bj/products.csv
Questions
Question 1 (2 pts)
First and foremost, welcome to The Data Mine! We hope that throughout your journey with us, you learn a lot, make new friends, and develop skills that will help you with your future career. Throughout your time with The Data Mine, you will have plenty of resources available should you need help. By coming to weekly seminar, posting on the class Piazza page, and joining Dr. Ward and the TA team’s office hours, you can ensure that you always have the support you need to succeed in this course.
The links to Piazza are:
Dr Ward is also available on Monday mornings in the Hillenbrand dining court from 8:30 AM to 11:20 AM. He is also available on Monday afternoons during 4:30 PM to 5:20 PM on Zoom at https://purdue-edu.zoom.us/my/mdward/
If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: https://the-examples-book.com/book/setup If you are still having issues with your ACCESS ID, please send an email containing as much information as possible about your issue to datamine-help@purdue.edu |
Let’s start off by starting up our first Jupyter session on Anvil! We always use the URL https://ondemand.anvil.rcac.purdue.edu and the ACCESS username that you were assigned (when you setup your account) and the ACCESS password that you chose. These are NOT the same as your Purdue account!
These credentials are not the same as your Purdue account. |
In the upper-middle part of your screen, you should see a dropdown button labeled The Data Mine
. Click on it, then select Jupyter Notebook
from the options that appear. If you followed all the previous steps correctly, you should now be on a screen that looks like this:
If your screen doesn’t look like this, please try and select the correct dropdown option again or ask one of the TAs during seminar for more assistance.
Jupyter Lab and Jupyter Notebook are technically different things, but in the context of seminar we will refer to them interchangeably to represent the tools you’ll be using to work on your projects. |
There are a few key parts of this screen to note:
-
Allocation: this should always be cis220051 for The Data Mine
-
Queue: again, this should stay on the default option
shared
unless otherwise noted. -
Time in Hours: The amount of time your Jupyter session will last. When this runs out, you’ll need to start a new session. It may be tempting to set it to the maximum, but our computing cluster is a shared resource. This means every hour you use is an hour someone else can’t use, so please ONLY reserve it for 1-2 hours at a time.
-
CPU cores: CPU cores do the computation for the programs we run. It may be tempting to request a large number of CPU cores and set it to the maximum, but our computing cluster is a shared resource. This means every computational core that you use is one that someone else can’t use. We only have a limited number of cores assigned to our team, so please ONLY reserve 1 or 2 cores at a time, unless the project needs more cores.
The Jupyter Lab environment will save your work at regular intervals, so that at the end of a session, your work should be automatically saved. Nonetheless, you can select File from the menu and Save Notebook any time that you want. (It is not necessarily, because the notebooks save automatically, but you can still save anytime if you want to.) |
With the key parts of this screen explained, go ahead and select 1 hour of time with 1 CPU cores and click Launch! After a bit of waiting, you should see something like below. Click connect to Jupyter and proceed to the next question!
You likely noticed a short wait before your Jupyter session launched. This happens while Anvil finds and allocates space for you to work. The more students are working on Anvil, the longer this will take, so it is our suggesting to start your projects early during the week to avoid any last-minute hiccups causing a missed deadline. Please do not wait until Fridays to complete and submit your work! |
Download the project template, as described here: https://the-examples-book.com/projects/templates
We give some information about kernels here: the-examples-book.com/projects/kernels
For the first question in this project, let’s try the first example from the kernel page: We will load the airports data set in Python and will display the head of the airports data set. (Most of our focus will be on R this semester, but we wanted to demonstrate one example in Python.)
import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv")
myDF.head()
Just try this Python code using the seminar
kernel (not the seminar-r
kernel) and make sure that you can see the first five rows of the airports data frame.
-
Use Python to show the output with the first five rows of the airports data frame.
-
Be sure to document your work from Question 1, using some comments and insights about your work.
Question 2 (2 pts)
As you continue to get comfortable with Jupyter Lab, you might want to read more about Jupyter Lab (this is optional). We want you to get comfortable with switching kernels in Jupyter Lab when needed. The different options that you see (like the seminar
kernel and the seminar-r
kernel) in the upper right hand of the screen are called kernels (please read the kernel documentation; this is the same as the documentation from Question 1).
When you first open the template, you may get a pop-up asking you to select what kernel you’ll be using. Select the seminar
kernel (not the seminar-r
kernel). If you do not get this pop-up, you can also select a kernel by clicking on the upper right part of your screen that likely says something similar to No Kernel
, and then selecting the kernel you want to use.
Use the seminar
kernel with R, and %%R
cell magic, to (again) display the first six lines of the airports data frame, but this time in R:
%%R
myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/airports.csv")
head(myDF)
Now do this again, using the seminar-r
kernel with R, and notice that you do NOT need the %%R
cell magic with the seminar-r
kernel. You can do all of this in the same Jupyter Lab notebook, just by changing the kernel.
myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/airports.csv")
head(myDF)
You have now loaded the first six lines of the airports data frame in three ways (once in Question 1, and now a second and a third time in Question 2).
A Jupyter notebook is made up of cells
, which you can edit and then run
. There are two types of cells we’ll work in for this class:
-
Markdown cells. These are where your writing, titles, sections, and paragraphs will go. Double clicking a markdown cell puts it in
edit
mode, and then clicking the play button near the top of the screen runs the cell, which puts it in its formatted form. More on this in a second. For now, just recognize that most markdown looks like regular text with extra characters like#
,*
, and-
to specify bolding, indentation font, size, and more! -
Code cells. These are where you will write and run all your code! Clicking the play button will run the code in that cell, and the programming language is specified by the language or languages known by the kernel that you chose.
For each question in The Data Mine, please always be sure to put some comments after your cells, which describe all of the work that you are doing in the cells, as well as your thinking and insights about the results.
Some common Jupyter notebooks shortcuts:
|
-
Use R to show the output with the first six rows of the airports data frame, and do this two ways: once using R with the
seminar
kernel, and once using R with theseminar-r
kernel. -
Be sure to document your work from Question 2, using some comments and insights about your work.
Question 3 (2 pts)
Which state has the largest number of airports? How many airports are located in that state? We can refer to one column of a data set by using the dollar sign and the name of the column. For instance, in the airports data set, the state where the airport is located is found in the column called myDF$state
. The table
function and the sort
function can be helpful, for answering this question.
We can get the counts of airports in each state:
table(myDF$state)
and we can put the results into numerical order:
sort(table(myDF$state))
-
Use R to show how many airports are found in each state, first in alphabetical order (which is the default), and then in sorted order. You are welcome to work in R and use the
seminar-r
kernel. -
Be sure to document your work from Question 3, using some comments and insights about your work.
Question 4 (2 pts)
In the ice cream products data set:
/anvil/projects/tdm/data/icecream/combined/products.csv
each product is represented just one time. How many times does each brand occur in the products.csv
data set?
In the ice cream reviews data set, on the other hand, there are thousands of reviews of each product:
/anvil/projects/tdm/data/icecream/combined/reviews.csv
How many times does each brand occur in the reviews.csv
data set?
You work will be similar to the work from Question 3. Be sure to document each question with comments about your work.
-
Use R to show how times each brand of ice cream appears, in each of the two files indicated above.
-
Be sure to document your work from Question 4, using some comments and insights about your work.
Question 5 (2 pts)
Using the plot
command to display the number of times that each brand occurs in the ice cream products.csv
data set from Question 4.
Then make a second plot
that displays the number of reviews for each brand in the ice cream reviews.csv
data set.
As always, be sure to document your work.
-
Use R to make two plots, illustrating how many times that each brand of ice cream appears in the two (respective) data sets with ice cream data.
-
Be sure to document your work from Question 5, using some comments and insights about your work.
Submitting your Work
Congratulations! Assuming you’ve completed all the above questions, you’ve just finished your first project for TDM 10100! If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.
Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.
Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points. At the bottom of each 101 project, you will find a comprehensive list of all the files that need to be submitted for that project. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!!
-
firstname_lastname_project1.ipynb
You must double check your You will not receive full credit if your |