Mean Squared Error, that is.

Questions we'll answer in this microlesson:

1. What is Mean Squared Error?
2. How do you calculate Mean Squared Error?
3. When and why do we use Mean Squared Error?

And, we'll have an opportunity to apply it to a real-life example: Moore's Law.

First, why do we care about error?

Let's say we just performed an experiment and collected some data (Figure 1).

We want to figure out the relationship between our points. So, we've made two predictions: the red line and the green line. But, how do we know which one is better?

We can look at the error. In statistics, error is a measure of how wrong a prediction is. The smaller the error, the better the prediction. So, we should pick the prediction with the smaller error.

Introducing Mean Squared Error.

Mean Squared Error (MSE) is one way we can calculate the error of a predictive model. It's not too tricky; here's how it works:

2. For each point, take the difference between the collected data and our prediction; square it.
3. Add it to our error, and repeat for each point.
4. Divide this sum by the number of points to get the average!

That wasn't too bad, wasn't it? Sometimes (like in textbooks), you'll see it in more complicated math notation:

$MSE = \frac{1}{m} \sum\limits_{i=1}^m (y-\hat y)^2$

But don't be scared! Now, you know what it means!

Some things to note:

• MSE is always nonnegative.

So, whether your prediction is greater than or less than your data point will have no bearing on your error!

$(5-3)^2=(3-5)^2=4$

The following two graphs will have the same MSE because every point is vertically equidistant from the prediction line.

• Because we are squaring the differences, bigger differences are heavily (nonlinearly) punished!

Example:

$(5-3)^2=4$ $(6-3)^2=9$

In this example, even though 5 and 6 are just 1 unit apart, their squared errors, when compared against 3, are 5 units apart.

Therefore, outliers in our data can be pretty hurtful to our MSE.

Let's find the MSE of each line in Figure 1 above, starting with the red line.

The equation of the line is:

$y = \frac{3}{2}x$

Notice the two sets of data points in Figure 1a. The blue points are our original data values, and the black points are our predicted values based on the line equation.

With MSE, we will be dealing with the vertical distances between the two sets of data points.

Now, the math.

$(1,\:2)$

$2$

$1.5$

We can now subtract the two to find the distance,
or "error," for the first point.

$2 - 1.5 = 0.5$

Repeating this for all six points, we get the following errors:

$0.5 \:\:\:\:\:\:0 \:\:\:\:\:\:1.5 \:\:\:-2 \:\:\:\:\:\:0.5 \:\:\:\:\:\:0$

Now, we square these errors and sum them (almost there!).

$(-0.5)^2 + 0^2 + (-1.5)^2 + 2^2 + (-0.5)^2 + 0^2$

$= 6.75$

Finally, since we want to take the mean of the squared errors,
we divide by the number of data points -- in our case, 6.

$6.75 \div 6$

$= 1.125$

Let's try applying the previous steps to the second line of fit,

$y = x + 2$

Figure 1b

Use the right side to step through the calculation step by step!

Let's take one point at a time, from left to right in Figure 1b.

For example, our first point has a y-value of 2 but an expected y-value of 3. So our squared error for Point 1 is:

$(2-3)^2=(-1)^2=1$

Find the squared error for all six points.

Now, a real-world example.

Your goal is to fit a model to Moore's Law.

Before we get fitting, we need to notice that this graph is not linear; it is exponential. This can actually cause some bias in our regression line, as our data will not be normally distributed about any prediction line drawn through the data.

$y = ae^{bx}$

We can easily fix this by converting the graph to a logarithmic scale (taking the natural log of both sides).

$\ln{y} = \ln{a} + bx$

Don't worry too much if this is confusing; the idea is that we are converting this exponential graph to a linear one, so that our MSE is as unbiased as possible.

Moore's Law (1971-2018)

This law was the observation that the number of transistors that could fit on a microchip would double every two years (first proclaimed in 1965).

Experiment with different models and try to find a good fit!

The smaller the MSE, the better! Keep in mind that the scale is logarithmic, so the line of fit can be somewhat deceiving. Click "Convert Scale" to see your fit on the original curve!

ln y = ln
+
x

To recap...

1. MSE punishes positive and negative error equally, and nonlinearly.

2. It is always a nonnegative value, and the smaller the MSE, the better our regression line.

3. We should restrict the use of MSE to linearly-correlated data, to minimize bias.

Keep in mind that there are various metrics to measure the error of a model, and MSE is just one of them.

How can I tell if my MSE is "good" or "bad"?

Unfortunately, there is no definite answer, as it depends heavily on the context of the problem. In certain situations, like training a model to automate a car, a high MSE would be disastrous. In other cases, it might be more forgivable.

It isn't always good to have a small MSE. If it is too close to 0, we might see overfitting. This means our model is too "tailored" for our specific data points and might be way off if we tried a new set of similar data points. Generally, the larger our range of data and the more diverse our data is, the more "OK" it is to have a large MSE.

The Moore's Law example above demonstrates that a model that is pretty close to the data points might still produce a big MSE, simply because we are working with fairly large numbers. There is no universal rule for what counts as a "good" MSE, but we can probably say that an MSE of 100 would be much more reasonable here than if we were looking at a data set with a range of 10.