STAT 19000: Project 10 — Spring 2021
Motivation: We’ve covered a lot of material in a very short amount of time. At this point in time, you have so many powerful tools at your disposal. Last semester in project 14 we used our new skills to build a beer recommendation system. It is pretty generous to call what we built a recommendation system. In the next couple of projects, we will use our Python skills to build a real beer recommendation system!
Context: This is the third project in a series of projects designed to learn about the pandas
and numpy
packages. In this project we build on to our previous project to finalize our beer recommendation system.
Scope: python, numpy, pandas
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/beer
Load the following datasets up and assume they are always available:
beers = pd.read_parquet("/class/datamine/data/beer/beers.parquet")
breweries = pd.read_parquet("/class/datamine/data/beer/breweries.parquet")
reviews = pd.read_parquet("/class/datamine/data/beer/reviews.parquet")
Project 09 Solution
Below is the solution for the previous projects, as we’ll be using its methods and don’t want to leave anybody behind:
def prepare_data(myDF, min_num_reviews):
# remove rows where score is na
myDF = myDF.loc[myDF.loc[:, "score"].notna(), :]
# get a list of usernames that have at least min_num_reviews
usernames = myDF.loc[:, "username"].value_counts() >= min_num_reviews
usernames = usernames.loc[usernames].index.values.tolist()
# get a list of beer_ids that have at least min_num_reviews
beerids = myDF.loc[:, "beer_id"].value_counts() >= min_num_reviews
beerids = beerids.loc[beerids].index.values.tolist()
# first remove all rows where the username has less than min_num_reviews
myDF = myDF.loc[myDF.loc[:, "username"].isin(usernames), :]
# remove rows where the beer_id has less than min_num_reviews
myDF = myDF.loc[myDF.loc[:, "beer_id"].isin(beerids), :]
return myDF
train = prepare_data(reviews, 1000)
def mutate_std_score(data: pd.DataFrame) -> pd.DataFrame:
"""
mutate_std_score is a function to use in conjunction with
pd.apply and pd.groupby to create a new column that is
the standardized score.
Args:
data (pd.DataFrame): A pandas DataFrame.
Returns:
pd.DataFrame: A modified pandas DataFrame.
"""
data['standardized_score'] = (data['score'] - data['score'].mean())/data['score'].std()
return data
train = train.groupby(["username"]).apply(mutate_std_score)
score_matrix = pd.pivot_table(train, values='standardized_score', index='username', columns='beer_id')
print(score_matrix.shape)
score_matrix.head()
score_matrix = score_matrix.fillna(score_matrix.mean(axis=0))
score_matrix.head()
Questions
Question 1
If you struggled or did not do the previous project, or would like to start fresh, please see the solutions to the previous project (will be posted Saturday morning) and feel free to use them as your own. Cosine similarity is a measure of similarity between two non-zero vectors. It is used in a variety of ways in data science. Here is a pretty good article that tries to give some intuition into it. sklearn
provides us with a function that calculates cosine similarity:
from sklearn.metrics.pairwise import cosine_similarity
Use the cosine_similarity
function on our score_matrix
. The result will be a numpy
array. Use the fill_diagonal
method from numpy
to fill the diagonals with 0. Convert the array back to a pandas
DataFrame. Make sure to manually assign the indexes of the new DataFrame to be equal to score_matrix.index
. Lastly, manually assign the columns to be score_matrix.index
as well. The end result should be a matrix with usernames on both the x and y axes. Each value in the cell represents how "close" one user is to another. Normally the values in the diagonals would be 1 because the same user is 100% similar. To prevent this we forced the diagonals to be 0. Name the final result cosine_similarity_matrix
.
-
Python code used to solve the problem.
-
head
ofcosine_similarity_matrix
.
Question 2
Write a function called get_knn
that accepts the cosine_similarity_matrix
, a username
, and a value, k
. The function get_knn
should return a pandas
Series or list containing the usernames of the k
most similar users to the input username
.
This may sound difficult, but it is not. It really only involves sorting some values and grabbing the first |
Test it on the following; we demonstrate the output if you return a list:
k_similar=get_knn(cosine_similarity_matrix,"2GOOFY",4)
print(k_similar) # ['Phil-Fresh', 'mishi_d', 'SlightlyGrey', 'MI_beerdrinker']
-
Python code used to solve the problem.
-
Output from running your code.
Question 3
Let’s test get_knn
to see if the results make sense. Pick out a user, and the most similar other user. First, get a DataFrame (let’s call it aux
) containing just their reviews. The result should be a DataFrame that looks just like the reviews
DataFrame, but just contains your users' reviews.
Next, look at aux
. Wouldn’t it be nice to get a DataFrame where the beer_id
is the row index, the first column contains the scores for the first user, and the second column contains the scores for the second user? Use the pivot_table
method to accomplish this, and save the result as aux
.
Lastly, use the dropna
method to remove all rows where at least one of the users has an NA
value. Sort the values in aux
using the sort_values
method. Take a look at the result and write 1-2 sentences explaining whether or not you think the users rated the beers similarly.
You could also create a scatter plot using the resulting DataFrame. If it is a good match the plot should look like a positive sloping line. |
-
Python code used to solve the problem.
-
Output from running your code.
-
1-2 sentences explaining whether or not you think the users rated the beers similarly.
Question 4
We are so close, and things are looking good! The next step for our system, is to write a function that finds recommendations for a given user. Write a function called recommend_beers
, that accepts three arguments: the train
DataFrame, a username
, a cosine_similarity_matrix
, and k
(how many neighbors to use). The function recommend_beers
should return the top 5 recommendations.
Calculate the recommendations by:
-
Finding the
k
nearest neighbors of the inputusername
. -
Get a DataFrame with all of the reviews from
train
for every neighbor. Let’s call thisaux
. -
Get a list of all
beer_id
that the user withusername
has reviewed. -
Remove all beers from
aux
that have already been reviewed by the user withusername
. -
Group by
beer_id
and calculate the meanstandardized_score
. -
Sort the results in descending order, and return the top 5 `beer_id`s.
Test it on the following:
recommend_beers(train, "22Blue", cosine_similarity_matrix, 30) # [40057, 69522, 22172, 59672, 86487]
-
Python code used to solve the problem.
-
Output from running your code.
Question 5
(optional, 0 pts) Improve our recommendation system! Below are some suggestions, don’t feel limited by them:
-
Instead of returning a list of
beer_id
, return the beer info from thebeers
dataset. -
Remove all retired beers.
-
Somehow add a cool plot.
-
Etc.
-
Python code used to solve the problem.
-
Output from running your code.