debug glance(by quqi99)_综合

作者：张华发表于：2021-05-19
版权声明：可以任意转载，转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明

glance与nova都使用rbd的话使用raw image能利用ceph的cow就不需要将image先从ceph下载到local，然后转换之后再上传到ceph，见-https://swamireddy.wordpress.com/2016/04/08/glance-s-quick-uploaddownload-with-ceph/ . 注：这个只是针对用ceph image做rootfs的

镜像下载失败(md5sum xx.img)导致无法创建虚机。
1, 排除ceph的问题

rbd info --pool=glance <img-id>
date; time rbd export --pool=glance <img-id> img;

2, 排除hacluster的问题. 其中10.5.100.0是hacluster上的VIP，10.13.154.206是其中的某一glance unit. 可以通过–os-image-url来指定使用哪个url

glance --os-image-url http://10.5.100.0:9292 image-download --file img --progress <img-id>
time glance --insecure --debug --os-image-url https://10.134.154.206:9282 image-download --file img-glance1 --progress <img-id> 2> glance_debug_9282_with_https.log

3, 排除apache2的问题，9282是apache2的端口，9272是glance的端口

time glance --insecure --debug --os-image-url http://10.134.154.206:9272 image-download --file img-glance1 --progress <img-id> 2> glance_debug_9272_without_https.log

4, 这时，确定，问题似乎只和glance有关，打开了glance debug log以及glance cli中添加了’–debug’之后也没看到特别的日志。
5, ceph是不推荐使用qcow2的，因为qcow2需要先转换为raw格式可能引发超时。测试一个小的qcow2 img (cirros)

wget http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
openstack image create --public --disk-format qcow2 --file cirros-0.3.4-x86_64-disk.img cirros
glance --os-image-url=http://10.134.154.206:9272 image-download $(openstack image show cirros -c id -f value) --file img --progress
md5sum img

这个超时可以由下列参数控制：

juju config nova-compute-kvm config-flags='block_device_allocate_retries=60,block_device_allocate_retries_interval=10'

5, 测试big raw img

wget https://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-amd64.img
sudo qemu-img convert -f qcow2 -O raw bionic-server-cloudimg-amd64.img bionic.raw
openstack image create --public --disk-format raw --file bionic.raw bionic
glance --os-image-url=http://10.134.154.206:9272 image-download $(openstack image show bionic -c id -f value) --file img --progress
md5sum img
md5sum ./bionic.raw

6, 从nova-compute通过rbd client来下载，之前直接在ceph-monitor上测试过，但这样能排除rbd-client的问题以及nova-compute和ceph之间的问题

rbd --name client.nova-compute --keyring /etc/ceph/ceph.client.nova-compute.keyring -p glance ls 
rbd --name client.nova-compute --keyring /etc/ceph/ceph.client.nova-compute.keyring export --pool glance 123b390f-ca8e-4d68-916f-509990996382 img
md5sum img

开始怀疑下列的问题。

或者使用raw用CoW，这样ceph中的image不需要下载（但这只是针对使用rootfs booted from ceph image的，对这种第二块盘的无效）
或者使用juju config glance restrict-image-location-operations=true ，见：
https://bugs.launchpad.net/charm-glance/+bug/1786144
切换成non-admin用户测试排除第二点.

openstack project create myproject --domain default
openstack user create --project-domain default --project myproject --domain default --password password myuser
openstack role add --user myuser --user-domain default --project myproject --project-domain default Member
openstack role assignment list --project myproject --namesexport OS_REGION_NAME=RegionOne
export OS_AUTH_URL=https://10.5.0.210:5000/v3
export OS_PROJECT_DOMAIN_NAME=default
export OS_AUTH_PROTOCOL=https
export OS_USERNAME=myuser
export OS_AUTH_TYPE=password
export OS_USER_DOMAIN_NAME=default
export OS_PROJECT_NAME=myproject
export OS_PASSWORD=password
export OS_IDENTITY_API_VERSION=3openstack token issue

代码测试：

cat << EOF | sudo tee test.py
#!/usr/bin/env python                                                           
#coding=utf-8
import rados
import rbd
cluster = rados.Rados(conffile='/etc/ceph/ceph.conf', rados_id='glance')
cluster.connect()
ioctx = cluster.open_ioctx('glance')
image = rbd.Image(ioctx, '123b390f-ca8e-4d68-916f-509990996382')
size = image.size()
bytes_left = size
f = open('tmp', 'wba')
chunk_size = 8388608
while bytes_left > 0:length = min(chunk_size, bytes_left)data = image.read(size - bytes_left, length)print(bytes_left)bytes_left -= len(data)f.write(data)
f.flush()
EOF
time python test.py

或者使用下列命令运行在glance unit上

time rbd --name client.nova-compute --keyring /etc/ceph/ceph.client.nova-compute.keyring export --pool glance 123b390f-ca8e-4d68-916f-509990996382 img;

最终发glance/0与glance/1不能运行上面的命令，但 glance/3却可以运行。排除了MTU问题。
同时发现glance/0不能运行下列命令, glance/3却可以， 10.10.0.41是ceph-osd/13, 6802是osd端口.
juju run -a glance --telnet 10.10.0.41 6802
tcpdump抓包，看到ceph-osd过来的包像被reset了 (size为0)
在这里插入图片描述
检查了glance unti与ceph-osd unit的iptables，无问题。看来应该是硬件路由器防火墙的问题