ceph mon: 时钟偏移导致mon发生slow request,导致大规模虚拟机磁盘不可用
ceph的mon集群实现了paxos提供一致性访问。最近碰到一个问题是3副本的mon集群,其中一个peon mon发生时钟偏移导致slow request,进而引发osd mark的严重故障。大规模虚拟机业务告警。
故障复现
osd每300s需要向mon上报自身的存活状态,也就是发送beacon给mon,beacon发送到mon一直卡住,如下:
{
"description": "osd_beacon(pgs [2.248,2.80,2.288,2.360,2.110,3.60,2.268,2.2e0,2.b8,2.2f1,2.19,2.89,2.91,2.331,2.249,2.161,2.d2,3.5a,2.12a,2.13a,2.21a,2.342,2.23a,2.362,2.243,2.3a3,2.83,2.3eb,3.b,2.22b,2.7b,2.2f3,2.3c,3.3c,2.36c,2.74,3.15,2.2c5,2.38d,2.345,2.
335,2.1a5,5.5,2.3a6,2.1e6,3.e,2.3ae,2.386,2.1ce,2.286,2.3be,3.5e,3.5f,2.57,2.267,2.1f,2.39f,2.8f,2.c7,2.237] lec 40508 last_purged_snaps_scrub 2023-03-09T11:23:31.503631+0000 v40509)",
"initiated_at": "2023-06-05T12:38:00.003962+0000",
"age": 291.61921003200001,
"duration": 291.61935419600002,
"type_data": {
"events": [
{
"time": "2023-06-05T12:38:00.003962+0000",
"event": "initiated"
},
{
"time": "2023-06-05T12:38:00.003961+0000",
"event": "throttled"
},
{
"time": "2023-06-05T12:38:00.003962+0000",
"event": "header_read"
},
{
"time": "2023-06-05T12:38:00.003964+0000",
"event": "all_read"
},
{
"time": "2023-06-05T12:38:00.004099+0000",
"event": "dispatched"
},
{
"time": "2023-06-05T12:38:00.004101+0000",
"event": "mon:_ms_dispatch"
},
{
"time": "2023-06-05T12:38:00.004101+0000",
"event": "mon:dispatch_op"
},
{
"time": "2023-06-05T12:38:00.004101+0000",
"event": "psvc:dispatch"
},
{
"time": "2023-06-05T12:38:00.004111+0000",
"event": "osdmap:wait_for_readable"
},
{
"time": "2023-06-05T12:38:00.004111+0000",
"event": "osdmap:wait_for_readable/paxos"
},
{
"time": "2023-06-05T12:38:00.004118+0000",
"event": "paxos:wait_for_readable"
}
],
"info": {
"seq": 3208527,
"src_is_mon": false,
"source": "osd.19 v2:10.133.17.70:6824/44",
"forwarded_to_leader": false
}
卡住位置为paxos:wait_for_readable
代码逻辑
PaxosService::dispatch(MonOpRequestRef op)
...
if (!is_readable(m->version)) {
dout(10) << " waiting for paxos -> readable (v" << m->version << ")" << dendl;
wait_for_readable(op, new C_RetryMessage(this, op), m->version);
return true;
}
...
bool Paxos::is_readable(version_t v)
{
...
ret =
(mon->is_peon() || mon->is_leader()) &&
(is_active() || is_updating() || is_writing()) &&
last_committed > 0 && is_lease_valid(); // must have a value alone, or have lease
dout(5) << __func__ << " = " << (int)ret
<< " - now=" << ceph_clock_now()
<< " lease_expire=" << lease_expire
<< " has v" << v << " lc " << last_committed
<< dendl;
return ret;
}
bool Paxos::is_lease_valid()
{
return ((mon->get_quorum().size() == 1)
|| (ceph::real_clock::now() < lease_expire));
}
在is_lease_valid的检验中,由于本地时钟发生偏移,本地时间比leader mon同步给peon mon的lease_expire时间大,导致时钟偏移的mon认为此时paxos是不可读状态,请求会被放入wait_for_readable队列等待。最终引发故障。
结论
mon时钟偏移会导致该mon无法处理请求,引发slow request,如果是osd的beacon请求,则会导致osd被mark down。默认leader mon每次续租lease_expire是在当前时间基础上+5s,所以peon mon时钟偏移超过5s出发paxos不可读。