TDM 20200: Project 8 — 2024
Motivation: Spark uses a distributed computing model to process data, which means that data is processed in parallel across a cluster of machines. PySpark is a Spark API that allows you to interact with Spark through the Python shell, or in Jupyter Lab, or in DataBricks, etc. PySpark provides a way to access Spark’s framework using Python. It combines the simplicity of Python with the power of Apache Spark.
Context: Understand components of Spark’s ecosystem that PySpark can use
Scope: Python, Spark SQL, Spark Streaming, MLib, GraphX
Dataset(s)
The following questions will use the following dataset:
-
/anvil/projects/tdm/data/whin/weather.parquet
Readings and Resources
Questions
Question 1 (2 points)
-
Run the example from Example book - PySpark, in Pandas and then in Spark. Make sure to show the time used by each of these methods.
-
Comment on the different speeds needed, to process data using Pandas vs PySpark.
Question 2 (2 points)
-
Run the following code to initiating a PySpark application. The name of our PySpark session is
sp
but you may use a different name. This is the entry point to using Spark’s functionality with a DataFrame and with the SQL API. -
Read the file
/anvil/projects/tdm/data/whin/weather.parquet
into a PySpark DataFrame calledmyDF
-
Show the first 5 rows of the resulting PySpark DataFrame.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
sp = SparkSession.builder.appName('TDM_S').config("spark.driver.memory", "2g").getOrCreate()
Question 3 (2 points)
-
List the DataFrame’s column names and data types
-
How many rows are in the DataFrame?
-
How many unique
station_id
values are in the data set?
|
Question 4 (2 points)
-
Create a Temporary View called
weather
from the PySpark DataFramemyDF
. -
Run a SQL Query to get the total number of records for each station.
-
Run a SQL Query to get the maximum wind speed recorded by each station.
-
Run a SQL Query to the average temperature recorded by each station.
|
|
Question 5 (2 points)
-
Explore the DataFrame, and run 2 SQL Queries of your own choosing. Explain the meaning of your two queries and what you learn from the queries.
Project 08 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and outputs for the assignment
-
firstname-lastname-project08.ipynb
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project08.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |