401 Project 2 - Intro to ML - Basic Concepts
Project Objectives
In this project, we will learn how to select an appropriate machine learning model. Understanding specifics of how the models work may help in this process, but other aspects can be investigated for this.
Datasets
-
/anvil/projects/tdm/data/iris/Iris.csv
-
/anvil/projects/tdm/data/boston_housing/boston.csv
-
/anvil/projects/tdm/data/forest/REF_SPECIES.csv
The Iris dataset is a classic dataset that is often used to introduce machine learning concepts. You can read more about it here. If you would like more information on the boston dataset, please read here. |
Questions
Question 1 (2 points)
In this project, we will use the Iris dataset and the boston dataset as samples to learn about the various aspects that go into choosing a machine learning model. Let’s review last project by loading the Iris and boston datasets, then printing the first 5 rows of each dataset.
-
Output of running code to print the first 5 rows of both datasets.
Question 2 (2 points)
One of the most distinguishing features of machine learning is the difference between classification and regression.
Classification is the process of predicting a discrete class label. For example, predicting whether an email is spam or not spam, whether a patient has a disease or not, or if an animal is a dog or a cat.
Regression is the process of predicting a continuous quantity. For example, predicting the price of a house, the temperature tomorrow, or the weight of a person.
Some columns may be misleading. Just because a column is a number does not mean it is a regression problem. One-hot encoding is a technique used to convert categorical variables into numerical variables (we will cover this deeper in future projects). Therefore, it is important to try and understand what a column represents, as just seeing a number does not necessarily mean it corresponds to a continuous quantity. |
Let’s look at the Species
column of the Iris dataset, and the MEDV
column of the boston dataset. Based on these columns, classify the type of machine learning problem that we would be solving with each dataset.
Here’s a trickier example: If we have an image of some handwritten text, and we want to predict what the text says, would we be solving a classification or regression problem? Why?
-
Would we likely be solving a classification or regression problem with the
Species
column of the Iris dataset? Why? -
Would we likely be solving a classification or regression problem with the
MEDV
column of the boston dataset? Why? -
Would we likely be solving a classification or regression problem with the handwritten text example? Why?
Question 3 (2 points)
Another important distinction in machine learning is the difference between supervised and unsupervised learning.
Supervised learning is the process of training a model on a labeled dataset. The model learns to map some input data to an output label based on examples in the training data. The Iris dataset is a great example of a supervised learning problem. Our dataset has columns such as SepalLengthCm
, SepalWidthCm
, PetalLengthCm
, and PetalWidthCm
that contain information about the flower. Additionally, it has a column labeled Species
that contains the label we want to predict. From these columns, the model can associate the features of the flower with the labeled species.
We can think of supervised learning as already knowing the answer to a problem, and working backwards to understand how we got there. For example, if we have a a banana, apple, and grape in front of us, we can look at each fruit and their properties (shape, size, color, etc.) to learn how to distinguish between them. We can then use this information to predict a fruit from just its properties in the future.
For example, given this table of data:
Color | Size | Label |
---|---|---|
Yellow |
Small |
A |
Red |
Medium |
B |
Red |
Large |
B |
Yellow |
Medium |
A |
Yellow |
Large |
B |
Red |
Small |
B |
You should be able to describe a relationship between the color and size, and the resulting label. If you were told an object is yellow and extra large, what would you predict the label to be?
The projects in 30100 and 40100 will focus on supervised learning. From our dataset, there will be a single column we want to predict, and the rest will be used to train the model. The column we want to predict is called the label/target, while the remaining columns are called features. |
Unsupervised learning is the process of training a model on an unlabeled dataset. As opposed to the model trying to predict an output variable, the model instead learns patterns in the data without any guidance. This is often used in clustering problems, eg. a store wants to group items based on how often they are purchased together. Examples of this can be seen commonly in recommendation systems (have you ever noticed how Amazon always seems to know what you want to buy?).
If we had a dataset of fruits that users commonly purchase together, we could use unsupervised learning to create groups of fruits to recommend to users. We don’t need to know the answer for what to recommend to the user beforehand; we are simply looking for patterns in the data.
For example, given the following dataset of shopping carts:
Item 1 | Item 2 | Item 3 |
---|---|---|
Apple |
Banana |
Orange |
Apple |
Banana |
Orange |
Apple |
Grape |
Kiwi |
Banana |
Orange |
Apple |
Orange |
Banana |
Apple |
Cantelope |
Watermelon |
Honeydew |
Cantelope |
Apple |
Banana |
We could use unsupervised learning to recommend fruits to users right before they check out. If a user had an orange and banana in their cart, what fruit would we recommend to them?
-
Predicted label for an object that is yellow and extra large in the table above.
-
What fruit would we recommend to a user who has an orange and banana in their cart?
-
Should we use supervised or unsupervised learning if we want to predict the
Species
of some data using the Iris dataset? Why?
Question 4 (2 points)
Another important tradeoff in machine learning is the flexibility of the model versus the interpretability of the model.
A model’s flexibility is defined by its ability to capture complex relationships within the dataset. This can be anything from
Imagine a simple function f(x) = 2x
. This function is very easy to interpret, it simply doubles x. However, it is not very flexible, as doubling the input is all it can do. A piecewise function like f(x) = { x < 5: 2x^2 + 3x + 4, x >= 5: 4x^2 - 7 }
is considered more flexible, because it can model more complex relationships. However it, becomes much more difficult to understand the relationship between the input and output.
We can also see this complexity increase as we increase the number of variables. f(x)
will typically be more interpretable than f(x,y)
, which will typically be more interpretable than f(x,y,z)
. When we get to a large number of variables, eg. f(a,b,c,…,x,y,z)
, it can become difficult to understand the impact of each variables on the output. However, a function that captures all of these variables can be very flexible.
Machine learning models can be imagined in the same way. Many factors, including the type of model and the number of features can impact the interpretability of the model. A function that can accurately capture the relationship between a large number of features and the target variable can be extremely flexible but not understandable to humans. A model that performs some simple function between the input and output may be very interpretable, but as the complexity of that function increases its interpretability decreases.
An important concept in this regard is the curse of dimensionality. The general idea is that as our number of features (dimensions) increases, the amount of data needed to get a good model exponentially increases. Therefore, it is impractical to have an extreme number of features in our model. Imagine given a 2d function y=f(x). Given some points that we plot, we probably pretty quickly find an approximation of f(x). However, imagine we are given y=f(x1,x2,x3,x4,x5). We would need a lot more points to find an approximation of f(x1,x2,x3,x4,x5), and understand the relationship between y and each of the variables. Just because we can have a lot of features in our model does not mean we should.
|
Please print the number of columns in the Iris dataset and the boston dataset. Based purely on the number of columns, would you expect a machine learning model trained on the Iris dataset to be more or less interpretable than a model trained on the boston dataset? Why?
-
How many columns are in the Iris dataset?
-
How many columns are in the boston dataset?
-
Based purely on the number of features, would you expect a machine learning model trained on the Iris dataset to be more or less interpretable than a model trained on the boston dataset? Why?
Question 5 (2 points)
Parameterization is the idea of approximating a function or model using parameters. If we have some function f
, and we have examples of f(x)
for many different x
, we can find an approximate function to represent f
. To make this approximation, we will need to choose some function to represent f
, along with the parameters of that function. For complex functions, this can be difficult, as we may not understand the relationship between x
and f(x)
, or how many parameters are needed to represent this relationship.
A non-parametrized model does not necessarily mean that the model does not have parameters. However, it means that we don’t know how many of these parameters exist or how they are used before training. The model itself will work to figure out what parameters it needs while training on the dataset. This can be visualized with splines, which are a type of curve that can be used to approximate a function. There are also non-parametrized models such as K-Nearest Neighbors Regression, which do not have a fixed number of parameters, and instead learn the function from the data.
If we have 5 points (x, y) and want to find a function to fit these points, through parameterization we would have a single function with multiple parameters that need to be adjusted to give us the best fit. However, with splines (a form of non-parametrization), we could create a piecewise function, where each piece is a linear function between two points. This function has no parameters, and is created by the model solely based on the data. You can read more about splines here.
A commonly used non-paramtrized model is k-nearest neighbors, which classifies points by comparing them to existing points in the dataset. In this way, the model does not have any parameters, but instead only learns from the data.
Linear regression is a parametrized model, where a linear relationship between inputs and output(s) is assumed. The data is then used to identify the values of the parameters to best fit the data.
If we already have a good understanding of the data, (eg. we know it to be some linear function or second order polynomial), it is likely best to choose a parametrized model. However, if we don’t have an understanding of the data, a non-parametrized model that learns the function from the data may be a better fit. |
To better understand the difference, please run the following code:
import matplotlib.pyplot as plt
a = [1, 3, 5, 7, 9, 11, 13]
b = [9, 6, 4, 7, 8, 15, 9]
x = [1, 2, 3, 4, 5, 6, 7]
plt.scatter(x, a, label='Function A')
plt.scatter(x, b, label='Function B')
plt.legend()
plt.xlabel('Feature X')
plt.ylabel('Label y')
plt.show()
Based on the plots shown, decide if each function would be better approximated by a parametrized or non-parametrized model.
-
Can you easily describe the relationship between
Feature X
andLabel y
for Function A? If so, what is the relationship? Would you use a parametrized or non-parametrized model to approximate this function? -
Can you easily describe the relationship between
Feature X
andLabel y
for Function B? If so, what is the relationship? Would you use a parametrized or non-parametrized model to approximate this function?
Question 6 (2 points)
As a practical example, we will take a look at the forest species dataset to determine the different aspects that go into choosing our machine learning model.
Load the forest species dataset and print the shape of the dataset its first 10 rows.
Based on what you see from those outputs, please answer the following questions:
-
Could you solve a regression problem with this dataset? What about a classification problem? What column(s) would you use as the target variable in each case?
-
Could you use unsupervised learning on this dataset? Supervised learning? Please explain your answer for each.
-
Do you think a model trained on all columns of this dataset would be very interpretable?
-
Do you think a parametrized model would work well given the number of features?
Submitting your Work
-
firstname_lastname_project2.ipynb
You must double check your You will not receive full credit if your |