STAT 39000: Project 10 — Fall 2020
Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data.
Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG
, MIN
, and MAX
.
Scope: SQL, sqlite
Dataset
The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at scholar-db.rcac.purdue.edu
.
As in Project 9, this is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.
Questions
Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. |
Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like |
For this project all solutions should be done using R code chunks, and the
|
Question 1
Connect to RStudio Server rstudio.scholar.rcac.purdue.edu, and, rather than navigating to the terminal like we did in the previous project, instead, create a connection to our MariaDB lahman database using the RMariaDB
package in R, and the credentials below. Confirm the connection by running the following code chunk:
con <- dbConnect(RMariaDB::MariaDB(),
host="scholar-db.rcac.purdue.edu",
db="lahmandb",
user="lahman_user",
password="HitAH0merun")
head(dbGetQuery(con, "SHOW tables;"))
In the example provided, the variable |
-
R code used to solve the problem.
-
Output from running your (potentially modified)
head(dbGetQuery(con, "SHOW tables;"))
.
Question 2
How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (SB
) and hit at least 40 home runs (HR
) in one year.
Use the |
You only need to run |
In our project template, for this project, make all of the SQL queries using the |
You can use
|
We already demonstrated the correct SQL query to use for the 40/40 club in the video below, but now we want you to use |
-
R code used to solve the problem.
-
The result of running the R code.
Question 3
How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"?
Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use |
Use the |
-
R code used to solve the problem.
-
The result of running the R code.
Question 4
The Batting Average is a metric for a batter’s performance. The Batting Average in a year is calculated by \$\frac{H}{AB}\$ (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to playerID
, teamID
, and yearID.
Use the |
-
R code used to solve the problem.
-
The result of running the R code.
Question 5
How many unique players have hit > 50 home runs (HR
) in a season?
If you view |
-
R code used to solve the problem.
-
The result of running the R code.
Question 6
Find the number of unique players that attended Purdue University. Start by finding the schoolID
for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic.
Use the |
You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates. |
To find IU’s id, try the following query: |
-
R code used to solve the problem.
-
The result of running the R code.
Question 7
Use R, SQL and the lahman database to create an interesting infographic. For those of you who are not baseball fans, try doing a Google image search for "baseball plots" for inspiration. Make sure the plot is polished, has appropriate labels, color, etc.
-
R code used to solve the problem.
-
The result of running the R code.