Ceph使用---dashboard启用及Prometheus监控
一、Ceph dashboard介绍
Ceph dashboard 是通过一个 web 界面, 对已经运行的 ceph 集群进行状态查看及功能配置等功能, 早期 ceph 使用的是第三方的 dashboard 组件。
如:Calamari
Calamari 对外提供了十分漂亮的 Web 管理和监控界面, 以及一套改进的 REST API 接口(不同于 Ceph 自身的 REST API) , 在一定程度上简化了 Ceph 的管理。 最初 Calamari 是作为Inktank 公司的 Ceph 企业级商业产品来销售, 红帽 2015 年收购 Inktank 后为了更好地推动 Ceph 的发展, 对外宣布 Calamari 开源
https://github.com/ceph/calamari
优点:
- 管理功能好
- 界面友好
- 可以利用它来部署 Ceph 和监控 Ceph
缺点:
- 非官方
- 依赖 OpenStack 某些包
VSM
Virtual Storage Manager (VSM)是 Intel 公司研发并且开源的一款 Ceph 集群管理和监控软件, 简化了一些 Ceph 集群部署的一些步骤, 可以简单的通过 WEB 页面来操作。
https://github.com/intel/virtual-storage-manager
优点:
- 易部署
- 轻量级
- 灵活(可以自定义开发功能)
缺点:
- 监控选项少
- 缺乏 Ceph 管理功能
Inkscope
Inkscope 是一个 Ceph 的管理和监控系统, 依赖于 Ceph 提供的 API, 使用 MongoDB来存储实时的监控数据和历史信息。
https://github.com/inkscope/inkscope
优点:
- 易部署
- 轻量级
- 灵活(可以自定义开发功能)
缺点:
- 监控选项少
- 缺乏 Ceph 管理功能
Ceph-Dash
Ceph-Dash 是用 Python 开发的一个 Ceph 的监控面板, 用来监控 Ceph 的运行状态。同时提供 REST API 来访问状态数据。
http://cephdash.crapworks.de/
优点:
- 易部署
- 轻量级
- 灵活(可以自定义开发功能)
缺点:
- 功能相对简单
二、启用dashboard 插件
https://docs.ceph.com/en/mimic/mgr/
https://docs.ceph.com/en/latest/mgr/dashboard/
https://packages.debian.org/unstable/ceph-mgr-dashboard #15 版本有依赖需要单独解决Ceph mgr 是一个多插件(模块化)的组件, 其组件可以单独的启用或关闭。
以下为在ceph-deploy 服务器操作:
新版本需要安装 dashboard, 而且必须安装在 mgr 节点, 否则报错如下:
The following packages have unmet dependencies: ceph-mgr-dashboard : Depends: ceph-mgr (= 15.2.13-1~bpo10+1) but it is not going to be installed E: Unable to correct problems, you have held broken packages.
在所有ceph-mgr节点安装ceph-dashboard插件
[root@ceph-mgr1 ~]# apt-cache madison ceph-mgr-dashboard ceph-mgr-dashboard | 16.2.10-1focal | https://mirrors.tuna.tsinghua.edu.cn/ceph/debian-pacific focal/main amd64 Packages ceph-mgr-dashboard | 15.2.16-0ubuntu0.20.04.1 | https://mirrors.tuna.tsinghua.edu.cn/ubuntu focal-updates/universe amd64 Packages ceph-mgr-dashboard | 15.2.12-0ubuntu0.20.04.1 | https://mirrors.tuna.tsinghua.edu.cn/ubuntu focal-security/universe amd64 Packages ceph-mgr-dashboard | 15.2.1-0ubuntu1 | https://mirrors.tuna.tsinghua.edu.cn/ubuntu focal/universe amd64 Packages [root@ceph-mgr1 ~]# apt install ceph-mgr-dashboard
启用dashboard
cephadmin@ceph-deploy:~/ceph-cluster$ ceph mgr module -h #查看帮助
cephadmin@ceph-deploy:~/ceph-cluster$ ceph mgr module ls #16.2.10新版本模块太多,大概3200行左右
cephadmin@ceph-deploy:~/ceph-cluster$ ceph mgr module enable dashboard
cephadmin@ceph-deploy:~/ceph-cluster$
Error ENOENT: all mgr daemons do not support module 'dashboard', pass --force to force enablement
解决方案:在所有mgr节点安装:ceph mgr module enable dashboard
注:模块启用后还不能直接访问, 需要配置关闭 SSL 或启用 SSL 及指定监听地址。
启用 dashboard 模块
Ceph dashboard 在 mgr 节点进行开启设置, 并且可以配置开启或者关闭 SSL, 如下:
cephadmin@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ssl false #关闭SSL cephadmin@ceph-deploy:~/ceph-cluster$ cephadmin@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ceph-mgr1/server_addr 172.16.88.111 #指定dashboard监听地址 cephadmin@ceph-deploy:~/ceph-cluster$ cephadmin@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ceph-mgr1/server_port 9009 #指定dashboard监听端口 cephadmin@ceph-deploy:~/ceph-cluster$ cephadmin@ceph-deploy:~/ceph-cluster$ ceph mgr services {} cephadmin@ceph-deploy:~/ceph-cluster$ ceph mgr services #查看ceph-mgr dashboard监听端口 { "dashboard": "http://172.16.88.111:9009/" } cephadmin@ceph-deploy:~/ceph-cluster$ cephadmin@ceph-deploy:~/ceph-cluster$ ceph -s #检查ceph集群健康状态 cluster: id: 8dc32c41-121c-49df-9554-dfb7deb8c975 health: HEALTH_OK services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 33h) mgr: ceph-mgr1(active, since 7m), standbys: ceph-mgr2 mds: 2/2 daemons up, 2 standby osd: 9 osds: 9 up (since 33h), 9 in (since 7d) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 13 pools, 433 pgs objects: 843 objects, 1.8 GiB usage: 6.2 GiB used, 894 GiB / 900 GiB avail pgs: 433 active+clean io: client: 29 KiB/s rd, 0 B/s wr, 28 op/s rd, 19 op/s wr cephadmin@ceph-deploy:~/ceph-cluster$ [root@ceph-mgr1 ~]# netstat -tnlp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 172.16.88.111:6805 0.0.0.0:* LISTEN 22005/ceph-mgr tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN 666/systemd-resolve tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1144/sshd: /usr/sbi tcp 0 0 0.0.0.0:40825 0.0.0.0:* LISTEN 790/rpc.statd tcp 0 0 127.0.0.1:6010 0.0.0.0:* LISTEN 21235/sshd: root@pt tcp 0 0 127.0.0.1:6011 0.0.0.0:* LISTEN 22259/sshd: root@pt tcp 0 0 0.0.0.0:9443 0.0.0.0:* LISTEN 7627/radosgw tcp 0 0 0.0.0.0:9900 0.0.0.0:* LISTEN 7627/radosgw tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/init tcp 0 0 172.16.88.111:6800 0.0.0.0:* LISTEN 7187/ceph-mds tcp 0 0 172.16.88.111:9009 0.0.0.0:* LISTEN 22005/ceph-mgr tcp 0 0 172.16.88.111:6801 0.0.0.0:* LISTEN 7187/ceph-mds tcp 0 0 127.0.0.1:42225 0.0.0.0:* LISTEN 777/containerd tcp 0 0 172.16.88.111:6802 0.0.0.0:* LISTEN 855/ceph-mds tcp 0 0 172.16.88.111:6803 0.0.0.0:* LISTEN 855/ceph-mds tcp 0 0 172.16.88.111:6804 0.0.0.0:* LISTEN 22005/ceph-mgr tcp6 0 0 :::22 :::* LISTEN 1144/sshd: /usr/sbi tcp6 0 0 ::1:6010 :::* LISTEN 21235/sshd: root@pt tcp6 0 0 ::1:6011 :::* LISTEN 22259/sshd: root@pt tcp6 0 0 :::37979 :::* LISTEN 790/rpc.statd tcp6 0 0 :::2049 :::* LISTEN 887/ganesha.nfsd tcp6 0 0 :::111 :::* LISTEN 1/init tcp6 0 0 :::8080 :::* LISTEN 21994/ceph-mgr [root@ceph-mgr1 ~]#
如果出现"Module 'dashboard' has failed: OSError("Port 8080 not free on '::'")"
cephadmin@ceph-deploy:~/ceph-cluster$ ceph -s cluster: id: 8dc32c41-121c-49df-9554-dfb7deb8c975 health: HEALTH_ERR Module 'dashboard' has failed: OSError("Port 8080 not free on '::'") services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 33h) mgr: ceph-mgr1(active, since 9m), standbys: ceph-mgr2 mds: 2/2 daemons up, 2 standby osd: 9 osds: 9 up (since 33h), 9 in (since 7d) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 13 pools, 433 pgs objects: 843 objects, 1.8 GiB usage: 6.2 GiB used, 894 GiB / 900 GiB avail pgs: 433 active+clean cephadmin@ceph-deploy:~/ceph-cluster$ ceph health detail HEALTH_ERR Module 'dashboard' has failed: OSError("Port 8080 not free on '::'") [ERR] MGR_MODULE_ERROR: Module 'dashboard' has failed: OSError("Port 8080 not free on '::'") Module 'dashboard' has failed: OSError("Port 8080 not free on '::'") cephadmin@ceph-deploy:~/ceph-cluster$
解决办法:
重启mgr服务 [root@ceph-mgr1 ~]# systemctl restart ceph-mgr@ceph-mgr1.service 或者 [root@ceph-mgr1 ~]# systemctl restart ceph-mgr.target 再次设置dashboard监听地址与端口 cephadmin@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ssl false cephadmin@ceph-deploy:~/ceph-cluster$ cephadmin@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ceph-mgr1/server_addr 172.16.88.111 cephadmin@ceph-deploy:~/ceph-cluster$ cephadmin@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ceph-mgr1/server cephadmin@ceph-deploy:~/ceph-cluster$ ceph mgr services { "dashboard": "http://172.16.88.111:9009/" } cephadmin@ceph-deploy:~/ceph-cluster$_port 9009 cephadmin@ceph-deploy:~/ceph-cluster$
设置 dashboard 账户及密码
cephadmin@ceph-deploy:~/ceph-cluster$ touch pass.txt cephadmin@ceph-deploy:~/ceph-cluster$ echo "12345678" > pass.txt cephadmin@ceph-deploy:~/ceph-cluster$ ceph dashboard set-login-credentials admin -i pass.txt ****************************************************************** *** WARNING: this command is deprecated. *** *** Please use the ac-user-* related commands to manage users. *** ****************************************************************** Username and password updated cephadmin@ceph-deploy:~/ceph-cluster$
dashboard 验证
ceph-node节点
ceph-osd详情
集群pool详情
cephfs
radosgw
三、dashboard SSL启用
如果要使用 SSL 访问。 则需要配置签名证书。 证书可以使用 ceph 命令生成, 或是 opessl命令生成。
https://docs.ceph.com/en/latest/mgr/dashboard/
cephadmin@ceph-deploy:~/ceph-cluster$ ceph dashboard create-self-signed-cert #创建证书 Self-signed certificate created cephadmin@ceph-deploy:~/ceph-cluster$ cephadmin@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ssl true #开启ssl认证 cephadmin@ceph-deploy:~/ceph-cluster$ cephadmin@ceph-deploy:~/ceph-cluster$ ceph mgr services { "dashboard": "http://172.16.88.111:9009/" } cephadmin@ceph-deploy:~/ceph-cluster$ 重启ceph-mgr1服务 [root@ceph-mgr1 ~]# systemctl restart ceph-mgr.target cephadmin@ceph-deploy:~/ceph-cluster$ ceph mgr services { "dashboard": "https://172.16.88.111:8443/" } cephadmin@ceph-deploy:~/ceph-cluster$
四、通过 prometheus监控ceph集群
[root@ceph-mgr1 ~]# mkdir /apps [root@ceph-mgr1 ~]# cd /apps/ [root@ceph-mgr1 apps]# ls [root@ceph-mgr1 apps]# wget https://github.com/prometheus/prometheus/releases/download/v2.39.1/prometheus-2.39.1.linux-amd64.tar.gz [root@ceph-mgr1 apps]# tar -xf prometheus-2.39.1.linux-amd64.tar.gz [root@ceph-mgr1 apps]# mv prometheus-2.39.1.linux-amd64 prometheus [root@ceph-mgr1 prometheus]# vi /etc/systemd/system/prometheus.service [root@ceph-mgr1 prometheus]# cat /etc/systemd/system/prometheus.service [Unit] Description=Prometheus Server Documentation=https://prometheus.io/docs/introduction/overview/ After=network.target [Service] Restart=on-failure WorkingDirectory=/apps/prometheus/ ExecStart=/apps/prometheus/prometheus --config.file=/apps/prometheus/prometheus.yml [Install] WantedBy=multi-user.target [root@ceph-mgr1 prometheus]# [root@ceph-mgr1 prometheus]# systemctl daemon-reload [root@ceph-mgr1 prometheus]# systemctl enable --now prometheus.service
部署 node_exporter:
各 node 节点安装 node_exporter
root@easzlab-deploy:~# ansible 'vm' -m shell -a "mkdir /apps" root@easzlab-deploy:~# mkdir ceph-prometheus root@easzlab-deploy:~# cd ceph-prometheus/ root@easzlab-deploy:~/ceph-prometheus# vi /etc/systemd/system/node-exporter.service [Unit] Description=Prometheus Node Exporter After=network.target [Service] ExecStart=/apps/node_exporter/node_exporter [Install] WantedBy=multi-user.target root@easzlab-deploy:~/ceph-prometheus# root@easzlab-deploy:~/ceph-prometheus# wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0/node_exporter-1.4.0.linux-amd64.tar.gz root@easzlab-deploy:~/ceph-prometheus# mv node_exporter-1.4.0.linux-amd64 node_exporter root@easzlab-deploy:~# ansible 'vm' -m copy -a "src=/root/ceph-prometheus/node_exporter dest=/apps" root@easzlab-deploy:~# ansible 'vm' -m copy -a "src=/root/ceph-prometheus/node-exporter.service dest=/etc/systemd/system/" root@easzlab-deploy:~# ansible 'vm' -m shell -a "systemctl daemon-reload && systemctl enable --now node-exporter.service"
配置 prometheus server 数据并验证
[root@ceph-mgr1 ~]# vim /apps/prometheus/prometheus.yml [root@ceph-mgr1 ~]# cat /apps/prometheus/prometheus.yml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["localhost:9090"] - job_name: 'ceph-node-data' static_configs: - targets: ['172.16.88.121:9100','172.16.88.122:9100','172.16.88.123:9100'] [root@ceph-mgr1 ~]# [root@ceph-mgr1 ~]# cd /apps/prometheus/ [root@ceph-mgr1 prometheus]# ./promtool check prometheus.yml promtool: error: expected command but got "prometheus.yml", try --help [root@ceph-mgr1 prometheus]# ./promtool check config prometheus.yml Checking prometheus.yml SUCCESS: prometheus.yml is valid prometheus config file syntax [root@ceph-mgr1 prometheus]# [root@ceph-mgr1 prometheus]# systemctl restart prometheus.service
通过 prometheus 监控 ceph 服务:
Ceph manager 内部的模块中包含了 prometheus 的监控模块,并监听在每个 manager 节点的 9283 端口, 该端口用于将采集到的信息通过 http 接口向 prometheus 提供数据。
https://docs.ceph.com/en/mimic/mgr/prometheus/?highlight=prometheus
启用 prometheus 监控模块
[root@ceph-deploy ~]# ceph mgr module enable prometheus [root@ceph-deploy ~]#
验证 manager 数据
配置 prometheus 采集数据
[root@ceph-mgr1 prometheus]# vi prometheus.yml [root@ceph-mgr1 prometheus]# cat prometheus.yml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["localhost:9090"] - job_name: 'ceph-node-data' static_configs: - targets: ['172.16.88.121:9100','172.16.88.122:9100','172.16.88.123:9100'] - job_name: 'ceph-cluster-data' static_configs: - targets: ['172.16.88.111:9283'] [root@ceph-mgr1 prometheus]# [root@ceph-mgr1 prometheus]# [root@ceph-mgr1 prometheus]# ./promtool check config prometheus.yml Checking prometheus.yml SUCCESS: prometheus.yml is valid prometheus config file syntax [root@ceph-mgr1 prometheus]# systemctl restart prometheus.service
通过 granfana 显示对 ceph 的集群监控数据及 node 数据
安装略
dpkg -i grafana-enterprise_8.5.6_amd64.deb systemctl enable --now grafana-server
添加Prometheus-ceph数据源
导入模板:
https://grafana.com/grafana/dashboards/5336 #ceph OSD
https://grafana.com/grafana/dashboards/5342 #ceph pools
https://grafana.com/grafana/dashboards/7056 #ceph cluser
https://grafana.com/grafana/dashboards/2842