如果觉得有用,麻烦点个赞噢~
FaceBoxes的论文地址是:https://arxiv.org/pdf/1708.05234.pdf
代码地址:
caffe版:https://github.com/sfzhang15/FaceBoxes
PyTorch版:https://github.com/zisianw/FaceBoxes.PyTorch
我的环境是ubuntu,下载了PyTorch版到本地,按照readme,运行make.sh,发现个问题:我的/usr/bin/nvcc路径是存在的,结果脚本认为lib64在/usr/lib64。我修改了一下utils/build.py的43行。
原先为:
nvcc = find_in_path('nvcc', os.environ['PATH'] + os.pathsep + default_path)
改成:
nvcc = find_in_path('nvcc', default_path + os.pathsep + os.environ['PATH'])
应该优先检查默认路径:/usr/local/cuda/bin
(另外一个解决方案是,删除/usr/bin/nvcc,然后将/usr/local/cuda/bin加入到环境变量$PATH里)
然后继续尝试运行make.sh,结果报错如下:
...
nms/cpu_nms.c:8966:13: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_traceback’; did you mean ‘curexc_traceback’?
tstate->exc_traceback = local_tb;
^~~~~~~~~~~~~
curexc_traceback
error: command 'gcc' failed with exit status 1
我的解决方案如下:
删掉以下几个文件,因为这些是运行make.sh后生成的文件
utils/nms/cpu_nms.c
utils/nms/gpu_nms.cpp
搞定。
下载WIDER FACE数据集,从百度网盘下。把WIDER_train.zip解压到data/WIDER_FACE目录下,解压出来的images目录移到data/WIDER_FACE目录下,因为train默认训练图片的路径是data/WIDER_FACE/images。
下载FaceBoxes提供的标注数据:annotations.tar
解压到data/WIDER_FACE/annotations目录,里面是一堆的xml文件,标注了训练集里人脸的坐标。
运行 python train.py训练。
马上遇到了如下报错:
...
Traceback (most recent call last):
File "train.py", line 69, in <module>
net = torch.nn.DataParallel(net, device_ids=list(range(num_gpu)))
File "/home/howard/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 133, in __init__
_check_balance(self.device_ids)
File "/home/howard/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 19, in _check_balance
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
File "/home/howard/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 19, in <listcomp>
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
File "/home/howard/anaconda3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 318, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id
原来是因为默认参数指定gpu数量有2个,但我电脑只有一个gpu,所以用以下指令运行:python train.py --ngpu 1
不久又出来以下错误:
Traceback (most recent call last):
File "train.py", line 152, in <module>
train()
File "train.py", line 117, in train
images = images.to(device)
File "/home/howard/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 29755) is killed by signal: Bus error.
看上去是电脑内存不足。查看到前面设置了参数num_workers默认为8。我的电脑cpu是4核,内存只有8G。子进程数不应该多于cpu数,不然上下文切换可能会拖慢整个进度。下面将workers数改成4个试试:
python train.py --ngpu 1 --num_workers 4
哈哈,活下来了,训练开始:
显然这个训练一天两天是完成不了的了。。。
这里需要创建一下最后保存训练好的模型的目录:weights/,训练过程中会每10个轮次会保存一次checkpoint。
也可以下载已经训练好的最终模型,放到weights/目录里
那接下来还是读代码吧。
网友评论