记PyTorch踩过的坑～(更新中)

作者: 与阳光共进早餐 | 来源:发表于2018-01-18 17:03 被阅读2423次

记PyTorch踩过的坑～(更新中)
记Python踩过的坑～(更新中)
踩过的pytorch坑
rasa对话系统踩坑记（三）
记django中踩过的各种坑(持续更新)
Pytorch采坑记~~持续更新中......
交互设计师所要避免的几个坑
pytorch --- 踩坑
PyTorch踩过的12坑精选
weex 踩坑记（持续更新中……）

像认真记录生活一样记录Bug.

1. 从autograd.Variable中取Tensor

BUG:
RuntimeError: copy from Variable to torch.FloatTensor isn't implemented
这个错误比较简单，就不给完整报错信息了。
问题分析:
错误语句：new_output[:,:,i,:,:]=temp2D_output
这里的new_output是Tensor类型,temp2D_output是Variable类型。
所以问题就变成了怎么样从autograd.Variable中取到Tensor
解决方法：
上图：

这是autograd.Variable的结构图，忘记了可以看看这个
PyTorch入门学习（二）：Autogard之自动求梯度
所以直接用Variable.data属性即可。

２. Pytorch的计算类型不匹配问题

BUG：
Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #2 'weight'
完整报错信息：

Traceback (most recent call last):
  File "p3d_model.py", line 425, in <module>
    out=model(data)
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "p3d_model.py", line 299, in forward
    x = self.maxpool_2(self.layer1(x))  #  Part Res2
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "p3d_model.py", line 166, in forward
    out=self.ST_A(out)
  File "p3d_model.py", line 120, in ST_A
    x = self.bn2(x)
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/nn/modules/batchnorm.py", line 37, in forward
    self.training, self.momentum, self.eps)
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/nn/functional.py", line 1013, in batch_norm
    return f(input, weight, bias)
RuntimeError: Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #2 'weight'

从报错信息来看应该是：需要的输入参数类型为torch.FloatTensor，但实际上给定是torch.cuda.FloatTensor

错误历程
可以看到出错的语句为：self.bn2(x)
一开始一直以为是自己传入的x类型不符合要求，

    def conv2_fyq(self,x):
        deep=x.shape[2] 
        temp2D_output=self.conv2(x[:,:,0,:,:])
        new_output=torch.Tensor(temp2D_output.shape[0],temp2D_output.shape[1],deep,temp2D_output.shape[2],temp2D_output.shape[3])
        for i in range(deep):
            temp2D_input=x[:,:,i,:,:]
            temp2D_output=self.conv2(temp2D_input)
            print (temp2D_output.shape)      # (10, ,160,160)
            new_output[:,:,i,:,:]=temp2D_output.data
        print (new_output.shape)                  # (10, ,16,160,160)  
        # print (new_output)
        result=new_output.type(torch.FloatTensor)
        # print (result)
        result=Variable(result)
        return result

   x = self.conv2_fyq(x)
   x = self.bn2(x)                    ＃error

所以一直在修改函数conv2_fyq()函数的返回值，希望从torch.cuda.FloatTensor类型转为torch.FloatTensor，试过很多方法，比如：

result=result.cpu()
借用numpy array类型作为中转
使用类型转换 result=new_output.type(torch.FloatTensor)

解决方法
首先可以肯定的是由于张量类型不一致导致的；
查了很多资料发现本质是由于两个张量不在同一个空间例如一个在cpu中，而另一个在gpu中因此会引发错误。
print result发现为torch.FloatTensor类型，由此想到出现问题的是nn.BatchNorm3d中其他的参数类型为torch.cuda.FloatTensor.
所以最后的解决方案：将result转为torch.cuda.FloatTensor类型
result=new_output.type(torch.cuda.FloatTensor)
参考文献
torch.Tensor类型的构建与相互转换
 expected CPU tensor (got CUDA tensor)
PyTorch遇到令人迷人的BUG与记录

这一个小bug的解决也花了近２小时了～
虽然没有直接在参考文献中找到答案，但还是深受启发～
自己解决一个木有现成答案的问题还是挺有意思的哈哈哈哈，心里话是开心都是骗人的，过程最折磨人。

3. 数据集label取值问题

BUG:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1512378360668/work/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu line=301 error=59 : device-side assert triggered
完整报错信息：

hl@hl-Precision-Tower-5810:~/Desktop/lovelyqian/CV_Learning/pseudo-3d-conv_S$ python P3D_train_fyq.py
hello
[1,1] loss: 4.593
[1,2] loss: 5.070
[1,3] loss: 4.854
[1,4] loss: 4.764
[1,5] loss: 4.807
[1,6] loss: 4.664
[1,7] loss: 4.797
[1,8] loss: 4.802
[1,9] loss: 4.808
[1,10] loss: 4.564
[1,11] loss: 4.509
[1,12] loss: 5.150
[1,13] loss: 4.323
[1,14] loss: 4.779
[1,15] loss: 5.295
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1512378360668/work/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu line=301 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "P3D_train_fyq.py", line 43, in <module>
    loss.backward()
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1512378360668/work/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:301

问题分析：
这个问题的错误跟源代码也没有多大的关系，可以直接看到是在用pytorch的backward时候出现了报错信息。
算了，还是贴一下源代码吧～～

 #dataset
    myUCF101=UCF101()
    classNames=myUCF101.get_className()
    # print (classNames)

    #model
    model = P3D199(pretrained=False,num_classes=101)
    model = model.cuda()
    # print (model)

     
    #loss and optimizer
    criterion=nn.CrossEntropyLoss()
    optimizer=optim.SGD(model.parameters(),lr=0.001)

    #train the network
    for epoch in range(2):         #loop over the dataset multiple times
        running_loss=0
        batch_num=myUCF101.set_mode('train')
        for batch_index in range(batch_num):
            # get the train data
            train_x,train_y=myUCF101[batch_index]   
            # warp them in Variable
            # train_x,train_y=Variable(train_x.cuda()),Variable(train_y.type(torch.LongTensor).cuda())
            train_x,train_y=Variable(train_x).cuda(),Variable(train_y.type(torch.LongTensor)).cuda()
            # set 0
            optimizer.zero_grad()
            # forward+backwar+optimize
            out=model(train_x)
            loss=criterion(out,train_y)
            loss.backward()
            optimizer.step()
            # print statistics
            running_loss+=loss.data[0]
            print ('[%d,%d] loss: %.3f' %(epoch+1,batch_index+1,running_loss))
            print >> f, ('[%d,%d] loss: %.3f' %(epoch+1,batch_index+1,running_loss))
            running_loss=0.0

从代码来看就很简单的逻辑,就是用常规的思路对网络模型进行数据的输入，梯度清零，计算输出值，计算损失函数，然后反向求梯度并更新。

这个问题没有找到一模一样的情况，但是在下文给出的参考文献中找到了解决思路，即跟label有关。
也发现每一次出现错误信息的时间段不一样，即能成功训练的batch的数目不同，有的时候多，有的时候少。
所以想到把train_y给输出来，多运行几次之后发现每次出错都是在label中出现了101的时候，再回过来考虑这个问题。
本项目用的是UCF101数据集，共有101种类型，所以在读入label的时候自然就根据数据集制作者给出的label进行处理，取值为1-101.　但是PyTorch要求的范围为：0-100

解决方法：
101种className, 则正确的数据label范围：0-100
参考文献：
pytorch 问题汇总
 Pytorch图像分割BUG心得汇总（一）

4.TypeError: only integer scalar arrays can be converted to a scalar index

BUG:
TypeError: only integer scalar arrays can be converted to a scalar index
情况说明：
给出相关的源代码

def test(dateset,model,model_state_path):
    myUCF101=dateset
    model.load_state_dict(torch.load(model_state_path))
    classNames=myUCF101.get_className()
    # test the network on the test data
    batch_num=myUCF101.set_mode('test')
    for batch_index in range(batch_num):
        batch_correct=0
        # get the test dat
        test_x,test_y_label=myUCF101[batch_index]
        # warp teat_x in Variable
        test_x=Variable(test_x.cuda())
        # get teh predicted output
        out=model(test_x)
        _,predicted_y=torch.max(out.data,1)
        predicted_label=classNames[predicted_y]
        batch_correct+= (predicted_label==test_y_label).sum()
        print('bactch: %d  accuracy is: %.2f' %(batch_index+1,batch_correct/float(len(test_y_label))))
        print >> f, ('bactch: %d  accuracy is: %.2f' %(batch_index+1,batch_correct/float(len(test_y_label))))
    print ('Test Finished')

主要看这两行：

  out=model(test_x)
  _,predicted_y=torch.max(out.data,1)
 predicted_label=classNames[predicted_y]

out是我取到的分类值; predicted是最有可能的label集合;classNames是具体的label。

错误分析：

1.　首先把predicted_y由cuda的longTensor改成numpy格式的。

predicted_y=predicted_y.cpu().numpy()

然后还是不行，就把predicted_y打印出来，发现是np.ndarray形式的，猜测可能需要转换为np.array()。
例如：

predicted_y=np.array(predicted_y,dtype=np.uint8)

这样依然没有解决问题，且网上提供的很多解决方案例如predicted_y=predicted_y.flatten()将多维数组转为展开成一维数组都行不通。

３. 既然提示需要直接使用np.array，所以我就定义了 a=np.arange(8),来测试classNames[a],发现还是一样的错误。到这里就觉得已经不是下标的问题了，所以才想到是不是classNames的问题。
然后才想到classNames不是array，而是list。所以我就将classNames做了一个从array到list的类型转换。

 classNames=np.array(classNames)

到这里就就决问题了。

解决方法 :

  # cuda.longTensor to numpy.array
  predicted_y=predicted_y.cpu().numpy()
  # ndarray to array
  predicted_y=np.array(predicted_y,dtype=np.uint8)
  # list to array
  classNames=np.array(classNames)

总结来说，TypeError: only integer scalar arrays can be converted to a scalar index这个问题可以从下标和数组这两个对象来看，都需要是np.array类型的。

5.DataLoader处理数据集时候的数据问题

BUG:
RuntimeError: invalid argument 2: cannot unsqueeze empty tensor at /opt/conda/conda-bld/pytorch_1512378360668/work/torch/lib/TH/generic/THTensor.c:601
完整报错信息：

Traceback (most recent call last):
  File "UCF101_pytorch_fyq.py", line 182, in <module>
    for i_batch,sample_batched in enumerate(dataloader):
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 210, in __next__
    return self._process_next_batch(batch)
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 230, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 42, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 116, in default_collate
    return {key: default_collate([d[key] for d in batch]) for key in batch[0]}
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 116, in <dictcomp>
    return {key: default_collate([d[key] for d in batch]) for key in batch[0]}
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 96, in default_collate
    return torch.stack(batch, 0, out=out)
  File "/home/hl/anaconda2/lib/python2.7/site-packages/torch/functional.py", line 62, in stack
    inputs = [t.unsqueeze(dim) for t in sequence]
RuntimeError: invalid argument 2: cannot unsqueeze empty tensor at /opt/conda/conda-bld/pytorch_1512378360668/work/torch/lib/TH/generic/THTensor.c:601

错误分析：
这个问题大概可以看出来是DataLoader的问题，但是使用框架的时候具体到哪个函数，哪行代码就会比较麻烦。看报错信息的最后一行应该可以知道是空的Tensor引起的。

我们不用DataLoader的情况下，输出数据样本：

print (len(myUCF101))
for i in range (5):
    sample=myUCF101[i]
    print(sample['video_x'].size(),sample['video_label'])

得到如下所示：

可以看到与video_label会有关系。

解决方法：
超级感谢这篇资料的引导了：# Cannot Unsqueeze Empty Tensor

确实是由于video_label是标量引起的，最后做了改动，具体如下所说：
错误

修改后

记PyTorch踩过的坑～(更新中)
像认真记录生活一样记录Bug. 1. 从autograd.Variable中取Tensor BUG:Runtim...
记Python踩过的坑～(更新中)
写在前面未经允许，不得转载，谢谢~~~ 像热爱记录生活一样记录bug，again！继记PyTorch踩过的坑～...
踩过的pytorch坑
1. 多卡训练模型如果使用torch.nn.DataParallel(model)多卡并行训练模型的话需要注意：...
rasa对话系统踩坑记（三）
在rasa对话系统踩坑记（二）中我自定义过两个component组件。也好久没更新采坑系列了，随着项目的进展迭代最...
记django中踩过的各种坑(持续更新)
django报错：TypeError: __str__returned non-string (type byte...
Pytorch采坑记~~持续更新中......
1、nn.Conv2D()输入参数数据格式不对报错：TypeError: new() received an i...
交互设计师所要避免的几个坑
前言工作中难免会踩到几个坑，即使现在不踩以后还会踩，只有踩过才会深刻记住，踩过说明爱过！但是踩过的坑必须把坑填满...
pytorch --- 踩坑
dataloader --- BrokenPipeError: https://blog.csdn.net/u01...
PyTorch踩过的12坑精选
nn.Module.cuda() 和 Tensor.cuda() 的作用效果差异无论是对于模型还是数据，cuda(...
weex 踩坑记（持续更新中……）
消失了一个月，努力为新项目倒腾 weex 中，记录一下遇到的问题。之后还会持续更新~ 目前，我使用的 weex 都...