Tensorflow在GPU下的Poolallocator Me

作者: slade_sal | 来源:发表于2018-06-28 09:48 被阅读3次

Tensorflow在GPU下的Poolallocator Me
Linux ssh安装tensorflow
TensorFlow GPU
Ubuntu 升级 tensorflow1 为 2
深度学习模型训练的加速
安装GPU版本的tensorflow和torch
Ubuntu18.04下tensorflow-gpu在impor
服务器GPU使用那些事
Ubuntu16.04 安装 TensorFlow 和配置 Op
工具-tensorflow-gpu一次搞定

我在在用GPU跑我一个深度模型的时候，发生了以下的问题：

...
2018-06-27 18:09:11.701458: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 63521 get requests, put_count=63521 evicted_count=1000 eviction_rate=0.0157428 and unsatisfied allocation rate=0.0173171
2018-06-27 18:09:11.701503: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
Global_step 2000        Train_loss: 0.0758
Global_step 3000        Train_loss: 0.0618
Global_step 4000        Train_loss: 0.0564
Global_step 5000        Train_loss: 0.0521
Global_step 6000        Train_loss: 0.0492
Global_step 7000        Train_loss: 0.0468
Global_step 8000        Train_loss: 0.0443
Global_step 9000        Train_loss: 0.0422
Global_step 10000       Train_loss: 0.0410
Global_step 11000       Train_loss: 0.0397
Global_step 12000       Train_loss: 0.0383
2018-06-27 18:13:59.743133: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 71532 get requests, put_count=71532 evicted_count=1000 eviction_rate=0.0139798 and unsatisfied allocation rate=0.0143013
2018-06-27 18:13:59.743167: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 256 to 281
...

除了常规的loss数据之外，我看到穿插在之间的warming informations ，虽然最后的结果没有任何问题，但是我抱着好奇的心态在stackoverflow找到了原因：

TensorFlow has multiple memory allocators, for memory that will be used in different ways. Their behavior has some adaptive aspects.
In your particular case, since you're using a GPU, there is a PoolAllocator for CPU memory that is pre-registered with the GPU for fast DMA. A tensor that is expected to be transferred from CPU to GPU, e.g., will be allocated from this pool.
The PoolAllocators attempt to amortize the cost of calling a more expensive underlying allocator by keeping around a pool of allocated then freed chunks that are eligible for immediate reuse. Their default behavior is to grow slowly until the eviction rate drops below some constant. (The eviction rate is the proportion of free calls where we return an unused chunk from the pool to the underlying pool in order not to exceed the size limit.) In the log messages above, you see "Raising pool_size_limit_" lines that show the pool size growing. Assuming that your program actually has a steady state behavior with a maximum size collection of chunks it needs, the pool will grow to accommodate it, and then grow no more. It behaves this way rather than simply retaining all chunks ever allocated so that sizes needed only rarely, or only during program startup, are less likely to be retained in the pool.
These messages should only be a cause for concern if you run out of memory. In such a case the log messages may help diagnose the problem. Note also that peak execution speed may only be attained after the memory pools have grown to the proper size.

加粗部分解释机制、处理方式和原因。总结起来就是，PoolAllocator会有一个内存分配机制，GPU和CPU之间不是独立的可以相互传输，如果你使用的空间太多，他就会提高原有的预设的空间大小，如果够用了，就没有什么影响了，但是，需要注意的是，兄弟你的数据加载量太大了，看看是不是改改batch size，一次性少加载点数据，或者干掉隔壁同事的任务。

欢迎大家关注我的个人bolg，更多代码内容欢迎follow我的个人Github，如果有任何算法、代码疑问都欢迎通过公众号发消息给我哦。

少年，扫一下嘛