本文为了验证在陕建生产环境变更集群的所有节点IP后,集群能否正常恢复的案例。验证环境为亦庄的一个k8s集群
一、验证说明
1.1 验证说明
本次验证主要目的:
- 验证k8s集群变更IP地址后,集群是否可正常恢复
- 验证底层存储IP变更后,中间件、数据库等是否可正常恢复
1.2 验证步骤说明
本次验证步骤如下:
- 部署一套三个Master节点,至少3个工作节点的k8s集群
- 安装部署NFS存储,并创建存储类
- 安装测试用的中间件、数据库等,安装minio(用于验证NFS地址变更后,数据是否可恢复)
- 导入测试数据,涉及MySQL,Es,minio等
- 更改所有节点的IP地址
- 恢复Minio对象存储
- 恢复其他中间件和数据库
- 开始恢复k8s集群
二、环境准备
此项略
三、创建存储
此项略
四、部署服务
本次部署了以下服务
- MySQL
- git
- elasticsearch
具体部署过程略
五、测试数据导入
- MySQL数据来源:重庆中科真实数据
- git数据来源:陕建数科平台库
- elasticsearch集群数据来源:插入了10000条假数据
- minio,创建了多个bucket,并分别写入不同类型的文件
具体过程略
六、更改所有节点IP地址
此项需要在vcenter上操作,具体过程略
| 序号 | 源IP | 变更后IP |
|---|---|---|
| 1 | 10.127.91.137 | 10.127.91.170 |
| 2 | 10.127.91.138 | 10.127.91.171 |
| 3 | 10.127.91.139 | 10.127.91.172 |
| 4 | 10.127.91.140 | 10.127.91.173 |
| 5 | 10.127.91.141 | 10.127.91.174 |
| 6 | 10.127.91.142 | 10.127.91.175 |
| 7 | 10.127.91.143 | 10.127.91.176 |
| 8 | 10.127.91.144 | 10.127.91.177 |
| 9 | 10.127.91.145 | 10.127.91.178 |
七、恢复持久化数据
由于使用了NFS提供的存储类,自动创建的PV会把NFS的地址写到PV里,因此无法做修改,需要创建额外的PV和PVC,手动绑定的方式进行恢复
对于无状态服务来说,由于即使设置了多副本,但是所有副本共享一份数据,因此只需要单独创建一个pvc即可。
对于有状态服务来说,在设置了多副本时,每一个副本都需要自己的专用存储,因此需要根据有状态服务的命名规则创建独立的pvc和pv
恢复方式介绍:
- 方法一:根据源pv,找到挂载目录,通过此目录创建新的pv和pvc
- 方法二:根据源pv,找到挂载目录,防止后面误删,将此目录进行重命名,然后根据重命名后的目录创建新的pv和pvc
1.1 无状态服务恢复
以minio为例,创建单独的pv和pvc,指定一个专门的存储类,如下所示:
apiVersion: v1
kind: PersistentVolume
metadata:
name: minio-pv
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 50Gi
nfs:
path: /data/nfsStorage/public-service-minio-nfs-pvc-pvc-873734a2-5aeb-4c6a-8bef-d70f1547ba89 #此路径需要查看之前minio对应的PV的路径
server: 10.127.91.176 # 修改成新的地址
persistentVolumeReclaimPolicy: Delete
storageClassName: minio-nfs-data
volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: minio-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: minio-nfs-data
最后修改minio的YAML文件即可(需要先删除原有服务)
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio-nfs
namespace: public-service
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: minio-nfs
strategy:
type: Recreate
template:
metadata:
creationTimestamp: null
labels:
app: minio-nfs
spec:
containers:
- args:
- gateway
- nas
- /shared/nasvol
- --console-address
- :9001
env:
- name: MINIO_ROOT_USER
valueFrom:
secretKeyRef:
key: appkey
name: minio-credentials
- name: MINIO_ROOT_PASSWORD
valueFrom:
secretKeyRef:
key: appsecret
name: minio-credentials
image: registry.cn-beijing.aliyuncs.com/gldsg-prod/minio:v20220916
imagePullPolicy: IfNotPresent
name: minio-nfs
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: "1"
memory: 2Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /shared/nasvol
name: storage
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: cloud-reg
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- name: storage
persistentVolumeClaim:
claimName: minio-pvc # 修改PVC的名称,指定为刚创建的名称
1.2 恢复单副本的有状态服务
以git服务为例
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: git-pv-0
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 20Gi
nfs:
path: /data/nfsStorage/public-service-git-data-git-0-pvc-1d7fc16f-f957-471e-8aed-58c117fdf80e #此路径需要查看之前minio对应的PV的路径
server: 10.127.91.176 # 修改成新的地址
persistentVolumeReclaimPolicy: Delete
storageClassName: git-data
volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: git-data-new-git-0
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
storageClassName: git-data
1.3 恢复多副本的有状态服务
以elasticsearch为例
# 第一个副本
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: elasticsearch-pv-0
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 20Gi
nfs:
path: /data/nfsStorage/public-service-elasticsearch-data-elasticsearch-0-pvc-5a49b9c5-097b-4040-a54e-941ade92b61d #此路径需要查看之前minio对应的PV的路径
server: 10.127.91.176 # 修改成新的地址
persistentVolumeReclaimPolicy: Delete
storageClassName: elasticsearch-data
volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: elasticsearch-data-new-elasticsearch-0
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
storageClassName: elasticsearch-data
# 第二个副本
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: elasticsearch-pv-1
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 20Gi
nfs:
path: /data/nfsStorage/public-service-elasticsearch-data-elasticsearch-1-pvc-9155ce12-a6f4-4d89-9b26-7b6f25959d4d #此路径需要查看之前minio对应的PV的路径
server: 10.127.91.176 # 修改成新的地址
persistentVolumeReclaimPolicy: Delete
storageClassName: elasticsearch-data
volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: elasticsearch-data-new-elasticsearch-1
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
storageClassName: elasticsearch-data
# 第三个副本
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: elasticsearch-pv-2
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 20Gi
nfs:
path: /data/nfsStorage/public-service-elasticsearch-data-elasticsearch-2-pvc-d3aa037f-6e23-4c6b-971f-61c592be2d24 #此路径需要查看之前minio对应的PV的路径
server: 10.127.91.176 # 修改成新的地址
persistentVolumeReclaimPolicy: Delete
storageClassName: elasticsearch-data
volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: elasticsearch-data-new-elasticsearch-2
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
storageClassName: elasticsearch-data
八、恢复K8S集群
1.1 变更第一台控制节点的IP
此步骤一定要注意:
- 必须一台一台操作
- 改完之后的IP一定跟其他节点网络是互通的,否则无法使用
1.2 恢复第一台控制节点
恢复方式为热恢复,需要依次恢复,不能同时恢复多台
1) 修改etcd, api-server, control-server, scheduler-server等静态POD的配置
# 所有master节点可同时执行此步
# cd /etc/kubernetes/manifests
sed -i "s/10.127.91.137/10.127.91.170/g" *
sed -i "s/10.127.91.138/10.127.91.171/g" *
sed -i "s/10.127.91.139/10.127.91.172/g" *
2) 修改所有master节点的hosts文件,可并行操作
sed -i "s/10.127.91.137/10.127.91.170/g" /etc/hosts
sed -i "s/10.127.91.138/10.127.91.171/g" /etc/hosts
sed -i "s/10.127.91.139/10.127.91.172/g" /etc/hosts
sed -i "s/10.127.91.140/10.127.91.173/g" /etc/hosts
sed -i "s/10.127.91.141/10.127.91.174/g" /etc/hosts
sed -i "s/10.127.91.142/10.127.91.175/g" /etc/hosts
sed -i "s/10.127.91.143/10.127.91.176/g" /etc/hosts
sed -i "s/10.127.91.144/10.127.91.177/g" /etc/hosts
sed -i "s/10.127.91.145/10.127.91.178/g" /etc/hosts
3) 修改所有master节点的haproxy文件
sed -i "s/10.127.91.137/10.127.91.170/g" /etc/haproxy/haproxy.cfg
sed -i "s/10.127.91.138/10.127.91.171/g" /etc/haproxy/haproxy.cfg
sed -i "s/10.127.91.139/10.127.91.172/g" /etc/haproxy/haproxy.cfg
# 修改完haproxy需要重启
systemctl restart haproxy
4) 修改ETCD数据
这一步非常重要,一定要操作无误,确认好节点的对应关系
参考文档:https://pytimer.github.io/2019/05/change-etcd-cluster-member-ip/
etcdctl命令行工具下载:https://github.com/etcd-io/etcd/releases/download/v3.4.3/etcd-v3.4.3-linux-amd64.tar.gz
# 执行此命令可以获取ETCD集群的成员列表,以及会打印ETCD集群中节点的一个唯一ID,通过此ID进行成员变更操作
# 这一步操作需要去另外一台主节点上操作(任一主节点)
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
# 执行结果
1bbc5fd9dc1687a, started, k8s-master-91-139, https://10.127.91.139:2380, https://10.127.91.139:2379, false
440629d2b8ac46bc, started, k8s-master-91-137, https://10.127.91.137:2380, https://10.127.91.137:2379, false
47f4d2a7f3d37348, started, k8s-master-91-138, https://10.127.91.138:2380, https://10.127.91.138:2379, false
# 通过以上执行结果可以看到,我们变更的第一台主节点的IP并未跟着改变,因此需要执行下面的命令手动update一下
# 440629d2b8ac46bc 这一串随机字符串就是第一台控制节点的唯一ID标识,通过此标识进行update操作
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
member update 440629d2b8ac46bc --peer-urls=https://10.127.91.170:2380
# 输出结果如下
Member 440629d2b8ac46bc updated in cluster 55c364da67e7bd4f
# 再次查看ETCD成员列表
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
# 输出结果
1bbc5fd9dc1687a, started, k8s-master-91-139, https://10.127.91.139:2380, https://10.127.91.139:2379, false
440629d2b8ac46bc, started, k8s-master-91-137, https://10.127.91.170:2380, https://10.127.91.137:2379, false
47f4d2a7f3d37348, started, k8s-master-91-138, https://10.127.91.138:2380, https://10.127.91.138:2379, false
# 可以看到peer-urls已经更改
执行完以上操作后,还存在一个问题,就是关于etcd的证书还未变更,ETCD集群初始化时并未设置一个静态地址,因此会将节点的IP信息生成到证书中,接下来需要删除ETCD的相关证书,重新生成一下
# 可以查看证书中的相关信息
# 指定具体的证书
openssl x509 -noout -text -in *.crt
5) 重新生成证书
ETCD证书重新生成
# 在第一台控制节点上操作
cd /etc/kubernetes/manifests/pki
rm -f server.*
rm -f healthcheck-client.*
rm -f peer.*
# 重新生成证书,为了方便这里直接生成所有证书,存在的情况下不会覆盖。
kubeadm init phase certs all
apiserver证书重新生成
cd /etc/kubernetes/pki
rm -f apiserver.*
kubeadm init phase certs apiserver \
--apiserver-advertise-address 10.127.91.171 \
--apiserver-cert-extra-sans ha.api.k8s.gm \
--apiserver-cert-extra-sans 172.17.0.10 \
--apiserver-cert-extra-sans 172.17.0.1 \
--apiserver-cert-extra-sans 127.0.0.1 \
--apiserver-cert-extra-sans kubernetes \
--apiserver-cert-extra-sans kubernetes.default \
--apiserver-cert-extra-sans kubernetes.default.svc \
--apiserver-cert-extra-sans kubernetes.default.svc.cluster.local
证书替换完成后需要重启下kubelet和docker
# 理论上只重启kubelet就可以了,这里顺便也重启下docker
systemctl restart docker
systemctl restart kubelet
6) 重启相关k8s组件
docker ps -a -q --filter name=k8s_kube | xargs -r docker rm --force --volumes
docker ps -a -q --filter name=k8s_etcd | xargs -r docker rm --force --volumes
7) 修改所有工作节点的hosts, haproxy等配置
sed -i 's/10.127.91.137/10.127.91.170/g' /etc/haproxy/haproxy.cfg
sed -i 's/10.127.91.138/10.127.91.171/g' /etc/haproxy/haproxy.cfg
sed -i 's/10.127.91.139/10.127.91.172/g' /etc/haproxy/haproxy.cfg
sed -i 's/10.127.91.137/10.127.91.170/g' /etc/hosts
sed -i 's/10.127.91.138/10.127.91.171/g' /etc/hosts
sed -i 's/10.127.91.139/10.127.91.172/g' /etc/hosts
至此第一台主节点已经完成
1.3 变更第二台主节点的IP
略
1.4 修改第二台master节点
略
1.5 变更第三台主节点的IP
略
1.6 修改第三台master节点
略
1.7 查看ETCD集群状态
当所有的主节点IP都变更完成,且完成主节点恢复之后,可以查看下ETCD集群状态
# ETCDCTL_API=3 etcdctl -w table --endpoints=https://10.127.91.170:2379,https://10.127.91.171:2379,https://10.127.91.172:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.127.91.170:2379 | 440629d2b8ac46bc | 3.4.3 | 7.4 MB | false | false | 779 | 93090 | 93090 | |
| https://10.127.91.171:2379 | 47f4d2a7f3d37348 | 3.4.3 | 7.3 MB | true | false | 779 | 93090 | 93090 | |
| https://10.127.91.172:2379 | 1bbc5fd9dc1687a | 3.4.3 | 7.4 MB | false | false | 779 | 93090 | 93090 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

