TDM 20100: Project 2 — Working with the bash shell
Motivation: In the previous project we became (re-)familiarized with working on Anvil, before diving straight into using the bash shell (the command line interface). By learning to create, destroy, and move files and directories, along with some basic commands to begin to analyze files, we will be well on our way to performing some primitive forms of data analysis, using nothing but the terminal!
Context: The ability to use bash shell commands such as cat
, cd
, du
, ls
, mv
, pwd
, rm
, sort
, uniq
, wc
, to get familiar with the bash shell, and get a basic understanding of the man
(manual) pages, will enable you to see some of the power and speed of using the bash shell.
Scope: Anvil, Jupyter Labs, CLI, Bash, GNU, filesystem manipulation
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/flights/subset/
(airplane data) -
/anvil/projects/tdm/data/election
(election data) -
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
(grocery store data)
Questions
Question 1 (2 pts)
In your $HOME
directory, you can store only 25GB of data, but in your $SCRATCH
directory, you can store up to 200TB of data.
Your $SCRATCH
directory is not intended for long-term storage, and it can be erased by the system administrators at regular points in time. Nonetheless, it can be very helpful for working on data sets that do not need to be stored for a long time.
Your project templates (and all of your Jupyter Lab files) should be stored in your $HOME
directory, but it is OK to put temporary data files into your $SCRATCH
directory.
Make a new file called myflights.csv
in the $SCRATCH
directory that has only the first line of the 1987.csv
file.
Now take all of the csv
data files called 1987.csv
through 2008.csv
from the /anvil/projects/tdm/data/flights/subset/
directory and add their rows of data, one at a time, to the myflights.csv
file. Be sure to not add the headers of these files. To accomplish this, use the grep
command with the -h
and -v
options. (The -h
option is used to hide the name of the file in the results, and the -v
option is used to avoid any lines of the files that have the word "Year".) To append data to the end of a file, use ">>".
In contrast, |
Now check that the resulting file has the correct number of lines.
The original files 1987.csv
through 2008.csv
have a total of 118914480 lines.
The file myflights.csv
has all of these lines, except for the 22 header lines from the 22 respective files, plus it has the header from the 1987.csv
file. So it should have a total of 118914480 - 22 + 1 = 118914459 lines.
Note: wc
, which stands for word count, is actually capable of much more than simply counting the words in a file! Take a look at some of the below examples, along with this man page, for some ideas about the power of wc
. The wc
command gives the number of lines, bytes, or characters within a file.
%%bash
# prints line count, then word count, then byte count for `2012.csv`
wc /anvil/projects/tdm/data/stackoverflow/processed/2012.csv
# prints just the line count for `2012.csv`
wc -l /anvil/projects/tdm/data/stackoverflow/processed/2012.csv
# prints just the word count for `2012.csv`
wc -w /anvil/projects/tdm/data/stackoverflow/processed/2012.csv
# prints just the byte count for `2012.csv`
wc -c /anvil/projects/tdm/data/stackoverflow/processed/2012.csv
Another note: The du
command (which stands for disk usage) measures the total disc space occupied by files and directories. Again, review the man page for du
and the below examples, and then move onto the tasks for the final set of tasks for this project.
%%bash
# print the number of bytes that all of the processed directory is taking up
du -b /anvil/projects/tdm/data/stackoverflow/processed
# prints the number of kilobytes that the processed directory is taking up
du --block-size=KB /anvil/projects/tdm/data/stackoverflow/processed
# prints the number of kilobytes that each file in the processed directory is taking up
du --block-size=KB -a /anvil/projects/tdm/data/stackoverflow/processed
-
Show the output from running
wc $SCRATCH myflights.csv
(which will demonstrate that you produced a file with 118914459 lines). -
Show the head of the file, namely:
head $SCRATCH myflights.csv
(which should have the header and the data about 9 flights from 1987). -
As always, be sure to document your work from Question 1 (and from all of the questions!), using some comments and insights about your work. We will stop adding this note to document your work, but please remember, we always assume that you will document every single question with your comments and your insights.
Question 2 (2 pts)
Sometimes we want to copy files directly. Let’s create a new directory in our $SCRATCH
folder and copy all of those files with flight data (1987.csv
through 2008.csv
) into that directory. Call the directory myfolder
. Inside that folder, after those files are copied, build another file (like in Question 1) called myflightsremix.csv
. Finally, compare these two files, using cmp $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv
(if the files are exactly the same, there should be no output because the files have no differences). Also compare them by running: ls -la $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv
which should demonstrate that they are the same size. Check wc $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv
to ensure that they have the same number of lines, words, and bytes.
Now go back to the scratch directory and remove this folder and its contents, using: cd $SCRATCH
and then rm -r $SCRATCH/myfolder
-
Show the output of:
cmp $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv
(which should be empty output, i.e., it should not do anything, because these files should have no differences) -
Show the output of:
ls -la $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv
(which should demonstrate that they are the same size) -
Show the output of:
wc $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv
(to ensure that they have the same number of lines, words, and bytes) -
Then throw away the folder
$SCRATCH/myfolder
and finally showls -la $SCRATCH
to demonstrate that the folder$SCRATCH/myfolder
is gone!
Question 3 (2 pts)
Copy the files itcont1980.txt
through itcont2024.txt
from the directory /anvil/projects/tdm/data/election
into your $SCRATCH
directory. Then create a new directory called mytemporarydirectory
in your $SCRATCH
directory and move all of these election files into that new directory. Finally, put the content from all of these election files into a new file called myelectiondata.txt
. Check the size of this new file using the wc
command. When you are finished, it is OK to remove the directory myelectiondata
from the $SCRATCH
directory.
-
Show the output of:
wc mytemporarydirectory/myelectiondata.txt
(which should show that the file has 229169299 lines and 1385963208 words and 42790681570 bytes).
Question 4 (2 pts)
Extract the Origin and Destination columns from all of the files 1987.csv
to 2008.csv
in the directory /anvil/projects/tdm/data/flights/subset
. Save these origins and destinations into a file called $SCRATCH/myoriginsanddestinations.txt
Then sort this data and save the results to: $SCRATCH/mysortedoriginsanddestinations.txt
Then use the uniq -c
command to get the counts corresponding to the number of times that each flight path occurred: $SCRATCH/mycounts.txt
Note: you need to sort the file before using uniq -c
Now sort the file again, this time in numerical order, using sort -n
and save the results to $SCRATCH/mysortedcounts.txt
Finally display the tail
of the file, which contains the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths.
-
Show the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths.
Question 5 (2 pts)
Use the cut
command with the flags -d, -f7
to extract the STORE_R
values from this file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Then use the techniques that you learned in Question 4, to discover how many times that each of the STORE_R
values appear in the file.
-
List the number of times that each of the
STORE_R
values appear in the file.
Submitting your Work
Congratulations! With this project complete, you’re now familiar with many of the basic uses of the command line! With these tools in your belt, you can now explore, analyze, and manipulate a large part of Anvil at your whims! Please don’t use your newfound powers for evil!
In the next project, we’ll be building on these more primal analysis tools by introducing some more complex commands that allow us to perform specific search-and-return processes on data. From there, the sky is the limit, and we will be ready to dive into one of the most useful and important concepts in all of code: pipelines. More to come!
-
firstname-lastname-project2.ipynb
You must double check your You will not receive full credit if your |