detach()、data、with no_grad()、requires_grad之间关系_综合

detach 、data、with no_grad、requires_grad之间关系

- - 一. requires_grad属性
  - - 1. requires_grad 的作用探讨
    - 2. 结论：
  - 二. detach()方法
  - 三. with torch.no_grad()
  - 四. data属性
  - 五. 总结

最近在学习pytorch的时候，遇到了几个新知识点，弄得我头大（本身头大，但更大 ==!）

requires_grad属性
detach()
with no_grad()方法
data属性

一直搞不明白这些东西到底有什么神奇的用法还有它们之间的关系；作为为什么大王我必须需总结一下，不然难受啊。

一. requires_grad属性

requires_grad：

官网说：If autograd should record operations on the returned tensor. Default: False.

是否追踪在张量上计算的所有操作，默认值为False
什么意思？直接上代码测试一下吧！

1. requires_grad 的作用探讨

# 测试一些什么都不做，查看计算的梯度
import torchx = torch.tensor([1.0, 2.0])
y1 = x ** 2
y2 = y1 * 2
y3 = y1 + y2print(y1, y1.requires_grad)
print(y2, y2.requires_grad)
print(y3, y3.requires_grad)# 为什么backward里面需要加一个torch.ones(y3.shape)？
# 这是另外一个需要讨论的问题了可以在留言区一起讨论
y3.backward(torch.ones(y3.shape))  # y1.backward() y2.backward()
print(x.grad)

# 结果：
tensor([1., 4.]) False
tensor([2., 8.]) False
tensor([ 3., 12.]) False
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

X的requires_grad设置为True之后则如下：

# 设置好requires_grad的值为True
import torchx = torch.tensor([1.0, 2.0], requires_grad=True)
y1 = x ** 2
y2 = y1 * 2
y3 = y1 + y2print(y1, y1.requires_grad)
print(y2, y2.requires_grad)
print(y3, y3.requires_grad)y3.backward(torch.ones(y3.shape))  # y1.backward() y2.backward()
print(x.grad)""" 结果： tensor([1., 4.], grad_fn=<PowBackward0>) True tensor([2., 8.], grad_fn=<MulBackward0>) True tensor([ 3., 12.], grad_fn=<AddBackward0>) True tensor([ 6., 12.]) """

此时的y1、y2、y3输出都多了一个属性参数值：
例如：y1的grad_fn = <PowBackward0>,就表示y1的上一次计算操作为pow，即指数运算
再回到我们的y1 = x ** 2 ,果然，正是如此。

2. 结论：

""" 1. 当grad_fn设置为Fasle或者默认时：计算梯度会出现如下错误 RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn 因为并没有追踪到任何计算历史，所以就不存在梯度的计算了2. 因此在最开始定义x张量的时候，就应当设置好是否计算追踪历史计算记录 """

二. detach()方法

detach():

官网又说：Returns a new Tensor, detached from the current graph.The result will never require gradient.

就是返回了一个新的张量，该张量与当前计算图完全分离。且该张量的计算将不会记录到梯度当中。
上代码看看那啥意思吧！

# 设置好requires_grad的值为True
import torchx = torch.tensor([1.0, 2.0], requires_grad=True)
y1 = x ** 2
y2 = y1.detach() * 2     # 注意这里在计算y2的时候对y1进行了detach()
y3 = y1 + y2print(y1, y1.requires_grad)
print(y2, y2.requires_grad)
print(y3, y3.requires_grad)y3.backward(torch.ones(y3.shape))  # y1.backward() y2.backward()
print(x.grad)

结果：

tensor([1., 4.], grad_fn=<PowBackward0>) True
tensor([2., 8.]) False
tensor([ 3., 12.], grad_fn=<AddBackward0>) True
tensor([2., 4.])"""根据结果可知y2所计算出来的张量，grad_fn属性没有被输出（其实为None），即不具有追踪能力了而y1和y3都仍然显示出各自上一次的计算操作，但是最终计算出的x的梯度发生了变化 """

对比一下使用detach()前后的梯度值tensor([ 6., 12.])和tensor([2., 4.])
（1）tensor([ 6., 12.])

y3 = y2 + y1,根据 y2 = y1*2, 而y1 = x ** 2
所以y3 = 3x**2,    y3对xi的偏导则为6xi
针对x = [1, 2]
所以，对应的梯度（偏导）则为：[6, 12]

（2）tensor([ 2., 4.])

y3 = y2 + y1,因为y2是根据y1.detach()得到的；
根据定义，所以计算梯度的时候不考虑y2,但是实际计算y3的值还是按原公式
因此计算梯度时。y3 = y1 + (y2不考虑)，所以y3 = x ** 2
y3对xi的偏导则为2xi
针对x = [1, 2]
所以，对应的梯度（偏导）则为：[2, 4]

总结一下detach()吧：
当我们在计算到某一步时，不需要在记录某一个张量的时，就可以使用detach()将其从追踪记录当中分离出来，这样一来该张量对应计算产生的梯度就不会被考虑了。

三. with torch.no_grad()

torch.no_grad():

官方还说：Disabling gradient calculation is useful for inference, when you are sure that you will not call :meth:Tensor.backward(). It will reduce memory consumption for computations that would otherwise have requires_grad=True.

先理解为也是类似取消梯度计算的一种方式，可以减少内存消耗，还是看代码结果吧！

# 设置好requires_grad的值为True
import torchx = torch.tensor([1.0, 2.0], requires_grad=True)
y1 = x ** 2with torch.no_grad():  # 这里使用了no_grad()包裹不需要被追踪的计算过程y2 = y1 * 2y3 = y1 + y2print(y1, y1.requires_grad)
print(y2, y2.requires_grad)
print(y3, y3.requires_grad)y3.backward(torch.ones(y3.shape))  # y1.backward() y2.backward()
print(x.grad)

计算结果：

tensor([1., 4.], grad_fn=<PowBackward0>) True
tensor([2., 8.]) False
tensor([ 3., 12.], grad_fn=<AddBackward0>) True
tensor([2., 4.])"""结果和detach()方法一致，就不在分析了 """

可想而知，实际上torch.no_grad()功能和detach()方法作用是一致的。
有差区别？
detach()是考虑将单个张量从追踪记录当中脱离出来；
而torch.no_grad()是一个warper，可以将多个计算步骤的张量计算脱离出去，本质上没啥区别。

四. data属性

写到这了，突然忘了一个重要的事情，detach()和data属性都是由tensor调用；
那么它们返回值存在什么关系呢？

# 设置好requires_grad的值为True
import torchx = torch.tensor([1.0, 2.0], requires_grad=True)
y1 = x ** 2print(y1)
print(y1.data)
print(y1.detach())
print(id(y1), id(y1.data), id(y1.detach()))
print(id(y1.storage()), id(y1.data.storage()), id(y1.detach().storage()))"""tensor([1., 4.], grad_fn=<PowBackward0>)tensor([1., 4.])tensor([1., 4.])3092644292392 3092631327272 30926313272723092650494536 3092650494536 3092650494536 """

可以很清楚的看到:

y1.detach()、y1.data、y1指向的内存单元是一样的
y1.detach()和y1.data返回的对象都是一样的

表示我又乱了！

另外，终于找到一个比较好的解释了：来自stackoverflow的解释¹

在这里插入图片描述
简单翻译一下：data方法是Variable的一个属性，而Variable在0.4版本以前，好像是封装Tensor所有操作，并提供额外属性的一个类（具体我也没了解）。
在Pytorch 0.4版本以后，Variable和Tensor进行了彻底的合并，因此.data也随着Variable一同消失了（尽管Variable仍然存在，并且向后兼容，但是已经被弃用了）。

.data 是从 Variable 中获取底层 Tensor 的主要方式。合并后，y = x.data得到的y是一个与x共享内存的Tensor，（前面也证实了内存地址不相同）并且 requires_grad = False，它与 x 的计算历史无关。

当然，在某些情况下 .data 可能不安全。对 x.data 的任何更改都不会被 autograd 跟踪，如果在反向过程中需要 x，那么计算出的梯度将不正确。另一种更安全的方法是使用 x.detach（），它将返回一个共享内存地址的 Tensor（requires_grad = False），但如果在反向过程中需要 x，那么 autograd 将会提示你已经修改了该张量。

测试代码²如下：

# tensor.data
>>> a = torch.tensor([1, 2, 3.], requires_grad =True)
>>> out = a.sigmoid()
>>> c = out.data
>>> c.zero_()
tensor([ 0., 0., 0.])>>> out       # 因为内存共享，所以out的数值被c.zero_()修改
tensor([ 0., 0., 0.])>>> out.sum().backward()  # 反向传播
>>> a.grad                
tensor([ 0., 0., 0.])

>>> a = torch.tensor([1,2,3.], requires_grad =True)
>>> out = a.sigmoid()
>>> c = out.detach()
>>> c.zero_()
tensor([ 0., 0., 0.])>>> out                   # out的值被c.zero_()修改 !!
tensor([ 0., 0., 0.])>>> out.sum().backward()  # 需要原来out得值，但是已经被c.zero_()覆盖了，结果报错
one of the variables needed for gradient computation has been modified by an inplace operation:[torch.FloatTensor [3]], which is output 0 of SigmoidBackward, is at version 1;expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

五. 总结

最后还是要总结一下：

requires_grad：在最开始创建Tensor时候可以设置的属性，用于表明是否追踪当前Tensor的计算操作。后面也可以通过requires_grad_（）方法设置该参数，但是只有叶子节点才可以设置该参数。
detach()方法：则是用于将某一个Tensor从计算图中分离出来。返回的是一个内存共享的Tensor，一变都变。
torch.no_grad()：对所有包裹的计算操作进行分离。
但是torch.no_grad()将会使用更少的内存，因为从包裹的开始，就表明不需要计算梯度了，因此就不需要保存中间结果。³
.data则是以前Pytorch中Variable的一个属性，返回的是一个共享内存的Tensor，一变都变，只是现在很少使用了。

可能有些地方理解不到位，希望各位指出相互探讨。
最后留一个问题：
为什么y.backward()的时候，有时候需要传递一个同等维度的Tensor呢？
欢迎大家一起来总结啊！！！

Is .data still useful in pytorch? ??
PyTorch中 tensor.detach() 和 tensor.data 的区别 ??
Difference between “detach()” and “with torch.nograd()” in PyTorch? ??