第二章.Regression -- 03.Evaluating Regression Models翻译_综合

So let’s talk how to evaluate a regression model. So we have to figure out a way to evaluate
the closeness of the truth to what we predicted which is f(x). So here’s what I proposed
earlier 1-f and we didn’t like that because we could have equally proposed this one. So
we want to use something that penalizes mistakes in either direction. So we could use that
the distance this is the distance between the truth and the predictions and we could
assess the mean absolute error which is the sum of these things that is one way to do
it. But you could also square all of them and get the least squares error or the sum
of the square error or the mean squared error. And you could think of it as capturing errors
in both directions like the absolute value that we had earlier. The same deal penalize
how far y is from f in either direction and then the root mean squared error is the just
the square root of the sum of squares error. Now switching topics, let’s talk about residuals.
The residuals or error is just the difference between the predictions and the truth. And
obviously we want all of these to be close to 0. And we also don’t want to see any
sort of structure or pattern in these residuals because then it would mean there is more modeling
that we have to do. Let’s discuss this a bit more.
Now we want the residuals to be as close to0 as possible, we don’t want any apparent
pattern in the. So what I am going to do here is I am going to plot a histogram of all of
the residuals, y, - f (xi) and where this histogram is higher is where most of the residuals
are. And ideally they are mostly close to 0.
So this would be good. Most of the residuals are around n 0 and there is no pattern in
these residuals. IUF you see this you can smile, because you know you probably got,
you did something right. This on the other hand is bad, this probably means that we missed
something. So what could explain a plot like this?
Maybe we are modeling the data with one line or something and actually the data falls under
3 lines or something which is why there are 3 bumps. Maybe we just missed the other 2
lines. I don’t now and I can’t tell exactly from looking at this what’s going on. In
order to figure out what’s causing this we have to figure out what factor is common
to everyone on this bump and then we have to figure out what’s common to everyone
in that bump and then we have to include these extra factors in the regression so that we
can model whatever signal is there and try to get rid of all those boxes. This is also
bad. Right perhaps here we just have the wrong model there’s a lot of points that we are
just not describing well. There a ton of points that are nowhere near f(x) not good. Now this
is the list of features from the automobile pricing data set and we have all these factors
about each car and then we are trying to predict the price. And we are going to do a regression
using a couple of these features trying to predict the price. Now if you are curious
we actually use the feature called fuel type engine horsepower, type of aspiration, which
is turbo or not, and curb weight. And we use those things to predict the price of the car.
And this is a plot that’s produced by R, which is something interesting.
So along here along that axis is the estimated price. So the more expensive cars are over
here. Now I just want you to pay attention here to the points here. Each point is a car.
You can see that for the cheaper cars or predictions are very good. The residuals are close to
a 0 for the cheaper cars. But then we get to the more expensive cars the predictions
get worse, our predictions are farther from the truth. So why might that be? We do have
a larger number of cheaper cars but the regression model is going to try to fit the cheaper cars
better just because there is more of them. But also I was talking to Steve about this
and he thinks it is because the more expensive cars are more diverse. So you just cannot
predict very well for the more pricey cars. What the heck is that? Right that guy we didn’t
get the rice on that one right atoll. Now this car happens to be an expensive 12-cylinder
Jaguar with some unusual properties. It’s actually two-seater with a huge hood and an
enormous engine. Actually this plot has the cars colored by the number of cylinders. So
you can see that we predict better for smaller number of cylinders than we do for larger
ones. Which I think is pretty interesting. So we might want to think about putting the
number of cylinders into the model. And then here is a residual plot, which is again a
histogram of residuals, and luckily for us, phew, it is nicely centered at 0. And here,
my friends. is the Jaguar.

我们来谈谈如何评估回归模型。所以我们需要找出一种计算方法。

事实与我们所预测的f(x)接近。这就是我的建议。

之前的1-f我们不喜欢这个因为我们可以同样地提出这个。所以

我们想用一些惩罚错误的方法。我们可以用这个。

距离这是事实与预测之间的距离，我们可以。

评估平均绝对误差这是一种方法的和。

它。但你也可以把它们都平方，得到最小二乘误差或和。

平方误差或平均平方误差。你可以把它看成是捕获误差。

在两个方向上，像我们之前的绝对值。相同的协议处罚

y在任意方向上的距离是多少，那么根的均值的平方误差是多少?

平方误差平方和的平方根。现在切换话题，我们来谈谈残差。

残差或误差只是预测和事实之间的差别。和

显然，我们希望所有这些都趋近于0。我们也不想看到任何。

这些残差的结构或模式，因为这意味着会有更多的模型。

我们必须这样做。让我们再讨论一下。

现在我们希望残差尽可能接近0，我们不想要任何明显的。

模式的。所以我要做的是画一个直方图。

残差，y， - f (xi)和这个直方图较高的地方是大多数残差的位置。

是这样的。理想情况下，它们接近于0。

这很好。大多数残差是在n0附近，没有模式。

这些残差。如果你看到这个，你可以微笑，因为你知道你可能有，

你做了正确的事情。另一方面，这很糟糕，这可能意味着我们错过了。

一些东西。那么有什么可以解释这样的情节呢?

也许我们正在用一行或某样东西对数据进行建模，实际上数据是在下面。

3条线或什么东西，这就是为什么会有3个凸起。也许我们刚好错过了另一个2。

行。我现在不知道，我也不能确切地知道到底发生了什么。在

为了找出导致这个的原因，我们需要找出什么因素是常见的。

对每个人来说，我们必须弄清楚每个人的共同之处。

在这个碰撞中，我们必须在回归中包含这些额外的因素。

可以对任何信号进行建模，并试图摆脱所有这些箱子。这也是

坏的。也许在这里我们有一个错误的模型我们有很多观点。

不是描述。有很多点离f(x)不太近。现在这

汽车定价数据集的特征列表，我们有所有这些因素吗?

关于每辆车，然后我们试着预测价格。我们要做一个回归。

使用这些特性来预测价格。如果你好奇的话。

我们实际上使用的特性叫做燃料类型引擎马力，这是一种渴望。

是不是涡轮增压，控制重量。我们用这些东西来预测汽车的价格。

这是一个由R产生的图，很有趣。

纵轴是估计价格。所以，越贵的车纵轴越高。

在这里。现在我想让你们注意这里的点。每一点都是一辆车。

你可以看到，对于便宜的汽车或预测是非常好的。残差接近。

买便宜的车要0英镑。但之后我们会看到更昂贵的汽车的预测。

更糟糕的是，我们的预测与事实相去甚远。为什么会这样呢?我们确实有

越来越多的廉价汽车，但回归模型将尝试适应更便宜的汽车。

最好只是因为他们更多。但我也和史蒂夫谈过这件事。

他认为这是因为昂贵的汽车更加多样化。所以你不能

对更昂贵的汽车进行很好的预测。那是什么鬼东西?对，我们没有。

把米放在那个右环礁上。现在这辆车恰巧是一个昂贵的12汽缸。

美洲虎有一些不寻常的特性。它实际上是两个座位，有一个巨大的引擎盖和一个。

巨大的引擎。实际上，这个情节是由圆柱体的数量来决定的。所以

你可以看到我们预测的圆柱体的数量要比大的多。

的人。我认为这很有趣。所以我们可能要考虑。

进入模型的钢瓶数。这是一个残差图，也是a。

残差的直方图，对我们来说幸运的是，它以0为中心。在这里,

我的朋友。捷豹。