当前位置: 代码迷 >> 综合 >> 使用容器装tensorflow gpu版笔记--- nvidia-docker
  详细解决方案

使用容器装tensorflow gpu版笔记--- nvidia-docker

热度:34   发布时间:2023-11-23 01:27:14.0

1.安装 nvidia-docker,详见https://github.com/NVIDIA/nvidia-docker/

2.完成后测试cuda可用:

docker run --gpus all nvidia/cuda:10.0-base-centos7 nvidia-smi

3.确认可用后会看到nvidia-smi命令的结果,然后开启自己容器的之路,首先创建一个容器当虚拟机用,这里我选择的是nvidia/cuda:10.0-base-centos7 镜像。

docker run -tdi --gpus all -v /data/projects:/run/projects --name='andp_buck3' nvidia/cuda:10.0-base-centos7 /bin/bash

3.1. 后来发现nvidia/cuda:10.0-base-centos7可能是个比较基础的容器不太够用,最后python-tensorflow设置gpu的时候会报形如: Could not dlopen library 'libcublas.so.10.0',那么开始尝试更全的基础容器:

docker run -tdi --gpus all -v /data/projects:/run/projects --name='andp_buck4' nvidia/cuda:10.1-cudnn7-devel-centos7 /bin/bash

nvidia/cuda:10.1-cudnn7-devel-centos7 这个就很大,后面继续类似的流程看。

4.进入刚刚创建的容器(容器内nvidia-smi命令无误)

docker exec -it andp_buck3 /bin/bash

5. 开始安装python环境,参照自己之前的内容前一篇博客,假定之前需要的文件都已经有了,在我的tools里面:

cd /run/projects/tools/
cd openssl-1.1.1./config --prefix=/usr/local/openssl shared zlib
make && make install
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/openssl/lib" >> $HOME/.bash_profile
source $HOME/.bash_profile
openssl versioncd ../Python-3.7.4
./configure --prefix=/usr/local/python374 --enable-optimizations --enable-shared --with-openssl=/usr/local/openssl
make && make install
ln -s  /usr/local/python374/bin/pip3 /usr/bin/
ln -s /usr/local/python374/bin/python3 /usr/bin/

6.尝试安装 tensorflow-gpu

pip3 install tensorflow-gpu==1.14 -i https://mirrors.aliyun.com/pypi/simple/

7.安装顺利完成,import tensorflow时仍然会出现:

ImportError: /lib64/libm.so.6: version `GLIBC_2.23' not found

那么,再走一下前一篇博客后面关于这里的步骤,这里也做个更流畅的汇总版吧:

8.因为之前build过gcc9.2,直接容器外的.so关联试试:

ln /run/projects/tools/glibc-2.30/build/math/libm.so.6 /lib64/libm.so.6  -s

之后再 import tensorflow 报另外的错:      /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20'

9.gcc9.2.0在容器外安装过,那应该还是在容器内走一边gcc9.2.0安装:

cd /make-4.2.1
./configure --prefix=$HOME/local
make -vmake && make install 
/root/local/bin/make -v
mv /usr/bin/make /usr/bin/make3
ln -s /root/local/bin/make /usr/bin/make
make -v
gmake
gmake -v
cd ../gcc-9.2.0
./contrib/download_prerequisites
mkdir build 
cd build/
../configure --prefix=/usr/local/gcc-9.2.0 --enable-bootstrap --enable-checking=release --enable-languages=c,c++ --disable-multilib
缺少报错: configure: error: Building GCC requires GMP 4.2+, MPFR 2.4.0+ and MPC 0.8.0+.
yum install wget bzip2 gcc gcc-c++ glibc-headers (不定必须)
yum install autoconf (不定必须)yum install gmp
yum install mpfr
yum install libmpc-devel bison../configure --prefix=/usr/local/gcc-9.2.0 --enable-bootstrap --enable-checking=release --enable-languages=c,c++ --disable-multilib (这次 ok了:)
make && make install (需要很久)gcc -v
echo -e '\nexport PATH=/usr/local/gcc-9.2.0/bin:$PATH\n' >> ~/.bash_profile 
source ~/.bash_profile 
gcc -v
ln -sv /usr/local/gcc-9.2.0/include/ /usr/include/gcc
ldconfig -v
ldconfig -p |grep gcc    #导出验证
gcc -v
cd ../../glibc-2.30/bulid/
LD_LIBRARY_PATH='' 
../configure  --prefix=/usr --disable-profile --enable-add-ons --with-headers=/usr/include --with-binutils=/usr/bin
make 
make install
sudo find / -name glibc*
strings  math/libm.so.6 | grep GLIBC_2.23
mv /lib64/libm.so.6 /lib64/libm.so.6.old
cp math/libm.so.6 /lib64/libm.so.6
find / -name libstdc++.so.6* 
strings /usr/lib64/libstdc++.so.6.0.19 | grep CXXABI_1.3
strings /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 | grep CXXABI_1.3
ln -s /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6
mv /usr/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6.old1
ln -s /usr/local/gcc-9.2.0/lib64/libstdc++.so.6 /usr/lib64/libstdc++.so.6

10.测试发现已经可以import  tensorflow啦,后面安装一些包把环境收尾就好

pip3 install pandas ipython sqlalchemy pymysql psycopg2-binary pyhive scipy numpy  -i https://mirrors.aliyun.com/pypi/simple/

就可以愉快的在容器内使用gpu训练tensorflow项目啦。

11.测试可以跑训练项目完成后,commit 容器并上传镜像:

[root@localhost ~]# docker commit -m 'for tensorflow-gpu-py374' -a='antony314' 3ff2d3cfa0ba antony314/centos76:v2.2
sha256:21f3b71f9939226f1d817c19ed88f14fa0c2ff5e76eed7b5b17b9fa9463801cf
[root@localhost ~]# docker images
REPOSITORY              TAG                 IMAGE ID            CREATED             SIZE
antony314/centos76      v2.2                21f3b71f9939        2 minutes ago       4.29GB
antony314/centos76      v2.1                52427f8da2c5        2 months ago        1.84GB
antony314/centos76      v2                  c65e5d82b7d4        2 months ago        1.84GB
antony314/centos76      v1                  241bcf6311b7        2 months ago        611MB
tensorflow/tensorflow   latest              d64a95598d6c        2 months ago        1.03GB
nvidia/cuda             10.0-base-centos7   e9f670f1d5b9        3 months ago        254MB
nvidia/cuda             9.0-base            1443caa429f9        3 months ago        137MB
nvidia/cuda             10.0-base           5026b20f9c3d        3 months ago        110MB
antony314/centos76      7.6init             2cf0fa81ce78        4 months ago        202MB
[root@localhost ~]# docker push antony314/centos76:v2.2
The push refers to repository [docker.io/antony314/centos76]
711e037a5568: Pushed 
74f64c7f6830: Mounted from nvidia/cuda 
ccbc602e5359: Mounted from nvidia/cuda 
a71b7655dacc: Mounted from nvidia/cuda 
5d01beb4238f: Mounted from nvidia/cuda 
877b494a9f30: Mounted from nvidia/cuda 
v2.2: digest: sha256:a742553910d749b1d1a2ab22d85e2f0145af301c6dbca4b89becf1c3b6266129 size: 1577

 

最后,暂时挂起一个一直很头疼的问题,容器越来越大。

  相关解决方案