IO存储问题排查。

发表于： 2023年8月20日 2023年8月20日
分类： ceph

作者：Rock
链接：https://www.zhihu.com/question/537962218/answer/3144192609
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

1 背景

当前有一个小的minio的集群，有大量(200-300个)的pod对minio存储进行读写的操作。

突然有一天开始，开发用户反映偶尔(大概几天1次)会出现上传文件失败的问题，根据监控图看了一下，iowait、uptime也和之前一样算是比较高，但并没有看到其他异常之处，就不了了之了。

然而接下来几天，这种文件上传失败的问题发生的越来越频繁，这时候才知道哪里真的出问题了，以下就是排查和处理的整个过程。

2 问题排查

思路：由于确认了是用户对minio存储集群的操作出现异常，所以上来直接就开始排查minio集群的几个节点的情况（当然也可能是k8s集群问题导致的）。

2.1 cpu使用率

从top和监控图中cpu的使用率来看，下图是minio的几个节点的cpu使用率，在凌晨1点半的时候直线飙升。

内存使用率正常

2.2 uptime负载

通过uptime和监控图中负载来看，有1个节点(z-cn-js-sc-000014)的负载极高，达到了1231%，其他节点正常。而每个节点的cpu核数是12核。

此时还不能判断cpu负载高的原因在哪里，我决定先看下磁盘io。

2.3 io检查

通过iostat检查

# iostat -x 1 10

通过上图，发现iowait很高(93.26%)，但wkB/s列中(写入)数据总和(21+12+12+32+12+21+12+49=171 MB/s)却不高，具体下面会说为什么这里的写入不高。

通过iotop检查

# iotop

通过上图，发现磁盘的IO都是99%，且都是minio的进程。

下面是监控中iowait的图：

发现 z-cn-js-sc-000014 节点的iowait在90%左右。

最后，我们在来看下netstat中的结果

由于minio启动使用的端口是8888，协议是udp，所以 netstat命令中会这样写：
# netstat -nap | grep 8888 # n是udp
# netstat -lantup | grep 8888 | grep TIME_WAIT
Proto Recv-Q Send-Q    Local Address        Foreign Address         State       PID/Program name
tcp6       0      0 172.18.5.53:8888        172.18.5.45:42924       TIME_WAIT   -
tcp6       0      0 172.18.5.53:8888        172.18.5.11:39058       TIME_WAIT   -
tcp6       0      0 172.18.5.53:8888        172.18.5.2:46062        TIME_WAIT   -
tcp6       0      0 172.18.5.53:8888        172.18.5.48:33954       TIME_WAIT   -
tcp6       0      0 172.18.5.53:8888        172.18.5.2:43330        TIME_WAIT   -
tcp6       0      0 172.18.5.53:8888        172.18.5.2:41368        TIME_WAIT   -
先看下TIME_WAIT，发现 TIME_WAIT 都是0，说明没有等待
# netstat -lantup | grep 8888 | grep ESTABLISHED
Proto Recv-Q Send-Q  Local Address           Foreign Address         State       PID/Program name
tcp6       0 1364096 172.18.5.53:8888        172.18.5.55:49890       ESTABLISHED 22348/minio
tcp6       0 1435136 172.18.5.53:8888        172.18.5.54:44818       ESTABLISHED 22348/minio
tcp6       0      0  172.18.5.53:8888        172.18.5.55:51218       ESTABLISHED 22348/minio
tcp6       0      0  172.18.5.53:8888        172.18.5.26:33408       ESTABLISHED 22348/minio
tcp6       0 781920  172.18.5.53:8888        172.18.5.54:47780       ESTABLISHED 22348/minio
tcp6       0      0  172.18.5.53:8888        172.18.5.26:33458       ESTABLISHED 22348/minio
tcp6       0 1159848 172.18.5.53:8888        172.18.5.54:47662       ESTABLISHED 22348/minio
tcp6       0 1018872 172.18.5.53:8888        172.18.5.54:47154       ESTABLISHED 22348/minio
再看下ESTABLISHED，发现第三列的Send-Q队列数字很大，说明都在排队。
解释:
Recv-Q：OS持有的，尚未交付给应用的数据的 字节数
Send-Q：已经发送给对端应用，但对端应用尚未ack的 字节数。此时，这些数据依然要由OS持有，也就是还是在本地缓冲区。
而如果发送队列Send-Q不能很快的清零，可能是有应用向外发送数据包过快，或者是对方接收数据包不够快。 
这两个值(Send-Q和Recv-Q)通常应该为0，如果不为0可能是有问题的。

总结：从上面几个图的结果可以看出，首先磁盘的iowait很高；但是由于服务器上的硬盘都是SSD，理论值应该可以扛得住400-500MB/s的写入，但是此时的写入也就在170MB/s左右。那么说明磁盘写入没达到理论值，却已经触发高IO了，很有可能是磁盘性能的问题。

2.4 磁盘读写性能检测（此命令对磁盘直接使用会导致数据全部丢失。）

# fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=psync -bs=16k -size=200G -numjobs=30 -runtime=1000 -group_reporting -name=mytest
mytest: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1
...
fio-3.19
Starting 30 threads
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
mytest: Laying out IO file (1 file / 204800MiB)
Jobs: 30 (f=30): [w(30)][0.5%][w=144MiB/s][w=9209 IOPS][eta 16m:35s]

此时发现写入的速度只有 144MB/s，所以说明瓶颈确实是在磁盘这里了。
而这里的磁盘是近期在京东买的三星SSD 870QVO，一块3000左右，应该是买到假货了。
因为3000块4T的SSD，读写应该支持400-500M/s，更好的可以支持到800-900M/s(后期买了三星的1643A，一块5000多，写入能到800-900MB/s)