ceph系列三、可用空间计算

发表于： 2023年6月27日 2023年6月27日
分类： ceph

在日常使用ceph过程中，我们常用ceph -s查看集群的状态和基本容量，也可以使用ceph df精确查看ceph的容量状态，那么两者有什么区别呢？随着集群存储文件的增多，为什么两者的呈现的可用容量会不一致，应该以那个为准。

一、ceph df

获取ceph pool信息

因为文件默认都存在data的pool，所以我们获取pool的相关信息。从结果可以看到pool只有2备份。这里因为是测试使用，生产环境建议3备份，具有更高的可靠性。

[root@test-01 ~]# ceph osd dump | grep pool | grep buckets.data

pool 3 'default.rgw.buckets.data' replicated size 2 min_size 1 crush_ruleset 5 object_hash rjenkins pg_num 256 pgp_num 256 last_change 194 flags hashpspool stripe_width 0

获取集群容量

从大的分类具有GLOBALS和POOLS，顾名思义GLOBALS代表全局的信息：SIZE（全局容量），AVAIL（全局可用容量），RAW USED（已使用容量），%RAW USED （使用容量占比）；POOLS就是每个pool的使用情况USED（已使用容量），%USED（使用占比），MAX AVAIL（最大可使用容量），OBJECTS（文件个数）。

[root@test-01 ~]# ceph df

GLOBAL:

SIZE AVAIL RAW USED %RAW USED

299G 273G 27337M 8.90

POOLS:

NAME ID USED %USED MAX AVAIL OBJECTS

.rgw.root 11 1588 0 122G 4

default.rgw.control 12 0 0 122G 8

default.rgw.data.root 13 77090 0 122G 222

default.rgw.gc 14 0 0 122G 32

default.rgw.log 15 0 0 122G 127

default.rgw.intent-log 16 0 0 122G 0

default.rgw.usage 17 0 0 122G 24

default.rgw.users.keys 18 3602 0 122G 122

default.rgw.users.email 19 0 0 122G 0

default.rgw.users.swift 20 0 0 122G 0

default.rgw.users.uid 21 49345 0 122G 209

default.rgw.buckets.index 22 0 0 122G 111

default.rgw.buckets.data 23 206G 62.78 122G 4643

default.rgw.meta 24 0 0 122G 0

default.rgw.buckets.non-ec 25 0 0 122G 358

rbd-01 26 0 0 83743M 0

从上面的信息中，你一定发现有一些问题。比如：AVAIL的值和MAX AVAIL为什么不相等？每个pool的MAX AVAIL为什么都一样？

我们先回答第二个问题，每个pool的MAX AVAIL为什么都一样，但是他们的总和又远大于AVAIL。这是因为ceph的每个pool都是共用相同的可用空间。 MAX AVAIL*副本数就是最终占用的集群磁盘空间，所以在ceph集群数据较少的时候 MAX AVAIL*副本数 ≈AVAIL

AVAIL

通过查看文档或者代码可以很清晰的知道GLOBAL的数值是基于底层文件系统统计而来，比如ceph的Filestore最终调用的就是::statfs()这个系统调用来获取信息的。这里的basedir.c_str()就是data目录。所以RAW SIZE计算的就是将所有osd 数据目录的磁盘使用量加起来，同理AVAIL就是磁盘可用容量的总和。

int FileStore::statfs(struct statfs *buf)

{

if (::statfs(basedir.c_str(), buf) < 0) {

int r = -errno;

assert(!m_filestore_fail_eio || r != -EIO);

assert(r != -ENOENT);

return r;

}

return 0;

}

MAX AVAIL

MAX AVAIL的计算比较复杂，我们通过源码分析，重点关注get_rule_weight_osd_map函数的pmap值，其实就是map(osd_id,weight/sum)的值，其他代码有兴趣的可以深入研究。

int CrushWrapper::get_rule_weight_osd_map(unsigned ruleno, map<int,float> *pmap)

{

if (ruleno >= crush->max_rules)

return -ENOENT;

if (crush->rules[ruleno] == NULL)

return -ENOENT;

crush_rule *rule = crush->rules[ruleno];

// build a weight map for each TAKE in the rule, and then merge them

for (unsigned i=0; i<rule->len; ++i) {

map<int,float> m;

float sum = 0;

if (rule->steps[i].op == CRUSH_RULE_TAKE) {

int n = rule->steps[i].arg1;

if (n >= 0) {

m[n] = 1.0;

sum = 1.0;

} else {

list<int> q;

q.push_back(n);

//breadth first iterate the OSD tree

while (!q.empty()) {

int bno = q.front();

q.pop_front();

crush_bucket *b = crush->buckets[-1-bno];

assert(b);

for (unsigned j=0; j<b->size; ++j) {

int item_id = b->items[j];

if (item_id >= 0) { //it's an OSD

float w = crush_get_bucket_item_weight(b, j);

m[item_id] = w;

sum += w;

} else { //not an OSD, expand the child later

q.push_back(item_id);

}

for (map<int,float>::iterator p = m.begin(); p != m.end(); ++p) {

map<int,float>::iterator q = pmap->find(p->first);

if (q == pmap->end()) {

(*pmap)[p->first] = p->second / sum;

} else {

q->second += p->second / sum;

}

return 0;

}

通过代码我们知道get_rule_avail通过调用osdmap.crush->get_rule_weight_osd_map来实现avail的计算。接下来我们观察get_rule_avail函数，函数的返回值就是MAX AVAIL的值。我们发现proj值就是osd的磁盘可用容量减去osd的mon_osd_full_ratio值，得出实际可用容量后除以wm（就是get_rule_weight_osd_map中的pmap），然后选择最小的值赋值给min（也就是MAX AVAIL）

int64_t PGMonitor::get_rule_avail(OSDMap& osdmap, int ruleno) const

{

map<int,float> wm;

int r = osdmap.crush->get_rule_weight_osd_map(ruleno, &wm);

if (r < 0) {

return r;

}

if (wm.empty()) {

return 0;

}

int64_t min = -1;

for (map<int,float>::iterator p = wm.begin(); p != wm.end(); ++p) {

ceph::unordered_map<int32_t,osd_stat_t>::const_iterator osd_info =

pg_map.osd_stat.find(p->first);

if (osd_info != pg_map.osd_stat.end()) {

if (osd_info->second.kb == 0 || p->second == 0) {

// osd must be out, hence its stats have been zeroed

// (unless we somehow managed to have a disk with size 0...)

// (p->second == 0), if osd weight is 0, no need to

// calculate proj below.

continue;

}

double unusable = (double)osd_info->second.kb *

(1.0 - g_conf->mon_osd_full_ratio);

double avail = MAX(0.0, (double)osd_info->second.kb_avail - unusable);

avail *= 1024.0;

int64_t proj = (int64_t)(avail / (double)p->second);

if (min < 0 || proj < min) {

min = proj;

}

} else {

dout(0) << "Cannot get stat of OSD " << p->first << dendl;

}

return min;

}

实例分析

[root@test-01 ~]# ceph osd df

ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS

2 0.14999 1.00000 33774M 4172M 29602M 12.35 1.39 172

0 0.14999 1.00000 33774M 2594M 31180M 7.68 0.86 179

1 0.14999 1.00000 34799M 4218M 30580M 12.12 1.36 206

3 0.14999 1.00000 33774M 2413M 31360M 7.15 0.80 177

4 0.14999 1.00000 34799M 2764M 32034M 7.94 0.89 177

5 0.14999 1.00000 33774M 3544M 30230M 10.49 1.18 196

6 0.14999 1.00000 33774M 3055M 30719M 9.05 1.02 190

7 0.14999 1.00000 34799M 2430M 32368M 6.98 0.78 180

8 0.14999 1.00000 33774M 2148M 31626M 6.36 0.71 171

TOTAL 299G 27341M 273G 8.90

MIN/MAX VAR: 0.71/1.39 STDDEV: 2.12

29602-(33774*0.05)/(0.14999/0.14999*9)/1024/2≈122G

发现计算的值和ceph df命令的值相同的，现在我们就了解MAX AVAIL的计算过程，同时我们也回答了刚才的第一个问题，因为两者的计算方法就不一样。

二、ceph -s

我们发现ceph -s的容量值和ceph df中的GLOBAL相同，所以相关的算法参考上面的分析。

[root@test-01 ~]# ceph -s

cluster xxx

health HEALTH_OK

....

pgmap v5141581: 760 pgs, 16 pools, 206 GB data, 5860 objects

27338 MB used, 273 GB / 299 GB avail

760 active+clean

三、总结

随着ceph集群存储文件占用容量的增加，可用空间越来越小。ceph df中MAX AVAIL值的计算比较精确，所以我们要通过ceph df的命令在查看可用空间。另外MAX AVAIL的计算取决与可用空间最小的osd，理论上如果数据绝对平衡，那么MAX AVAIL*副本数≈AVAIL。反之如果我们平衡高负载的osd，就可以增大MAX AVAIL的值。

————————————————

原文链接：https://blog.csdn.net/effort6/article/details/106553316/

tingyuxinsheng@gmail.com

1292

tingyuxinsheng@gmail.com

发表评论 取消回复

发表评论取消回复