k8s集群变更所有节点IP恢复案例
本文最后更新于0 天前,其中的信息可能已经过时,如有错误请发送邮件到big_fw@foxmail.com

本文为了验证在陕建生产环境变更集群的所有节点IP后,集群能否正常恢复的案例。验证环境为亦庄的一个k8s集群

一、验证说明

1.1 验证说明

本次验证主要目的:

  1. 验证k8s集群变更IP地址后,集群是否可正常恢复
  2. 验证底层存储IP变更后,中间件、数据库等是否可正常恢复

1.2 验证步骤说明

本次验证步骤如下:

  1. 部署一套三个Master节点,至少3个工作节点的k8s集群
  2. 安装部署NFS存储,并创建存储类
  3. 安装测试用的中间件、数据库等,安装minio(用于验证NFS地址变更后,数据是否可恢复)
  4. 导入测试数据,涉及MySQL,Es,minio等
  5. 更改所有节点的IP地址
  6. 恢复Minio对象存储
  7. 恢复其他中间件和数据库
  8. 开始恢复k8s集群

二、环境准备

此项略

三、创建存储

此项略

四、部署服务

本次部署了以下服务

  • MySQL
  • git
  • elasticsearch

具体部署过程略

五、测试数据导入

  • MySQL数据来源:重庆中科真实数据
  • git数据来源:陕建数科平台库
  • elasticsearch集群数据来源:插入了10000条假数据
  • minio,创建了多个bucket,并分别写入不同类型的文件

具体过程略

六、更改所有节点IP地址

此项需要在vcenter上操作,具体过程略

序号源IP变更后IP
110.127.91.13710.127.91.170
210.127.91.13810.127.91.171
310.127.91.13910.127.91.172
410.127.91.14010.127.91.173
510.127.91.14110.127.91.174
610.127.91.14210.127.91.175
710.127.91.14310.127.91.176
810.127.91.14410.127.91.177
910.127.91.14510.127.91.178

七、恢复持久化数据

由于使用了NFS提供的存储类,自动创建的PV会把NFS的地址写到PV里,因此无法做修改,需要创建额外的PV和PVC,手动绑定的方式进行恢复

对于无状态服务来说,由于即使设置了多副本,但是所有副本共享一份数据,因此只需要单独创建一个pvc即可。
对于有状态服务来说,在设置了多副本时,每一个副本都需要自己的专用存储,因此需要根据有状态服务的命名规则创建独立的pvc和pv

恢复方式介绍:

  • 方法一:根据源pv,找到挂载目录,通过此目录创建新的pv和pvc
  • 方法二:根据源pv,找到挂载目录,防止后面误删,将此目录进行重命名,然后根据重命名后的目录创建新的pv和pvc

1.1 无状态服务恢复

minio为例,创建单独的pv和pvc,指定一个专门的存储类,如下所示:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: minio-pv
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 50Gi
  nfs:
    path: /data/nfsStorage/public-service-minio-nfs-pvc-pvc-873734a2-5aeb-4c6a-8bef-d70f1547ba89  #此路径需要查看之前minio对应的PV的路径
    server: 10.127.91.176   # 修改成新的地址
  persistentVolumeReclaimPolicy: Delete
  storageClassName: minio-nfs-data
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: minio-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  storageClassName: minio-nfs-data

最后修改minio的YAML文件即可(需要先删除原有服务)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio-nfs
  namespace: public-service
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: minio-nfs
  strategy:
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: minio-nfs
    spec:
      containers:
      - args:
        - gateway
        - nas
        - /shared/nasvol
        - --console-address
        - :9001
        env:
        - name: MINIO_ROOT_USER
          valueFrom:
            secretKeyRef:
              key: appkey
              name: minio-credentials
        - name: MINIO_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              key: appsecret
              name: minio-credentials
        image: registry.cn-beijing.aliyuncs.com/gldsg-prod/minio:v20220916
        imagePullPolicy: IfNotPresent
        name: minio-nfs
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: "1"
            memory: 2Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /shared/nasvol
          name: storage
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: cloud-reg
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: storage
        persistentVolumeClaim:
          claimName: minio-pvc  # 修改PVC的名称,指定为刚创建的名称

1.2 恢复单副本的有状态服务

git服务为例

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: git-pv-0
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 20Gi
  nfs:
    path: /data/nfsStorage/public-service-git-data-git-0-pvc-1d7fc16f-f957-471e-8aed-58c117fdf80e  #此路径需要查看之前minio对应的PV的路径
    server: 10.127.91.176   # 修改成新的地址
  persistentVolumeReclaimPolicy: Delete
  storageClassName: git-data
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: git-data-new-git-0
spec:
  accessModes:
    - ReadWriteMany
  resources: 
    requests:
      storage: 20Gi
  storageClassName: git-data

1.3 恢复多副本的有状态服务

elasticsearch为例

# 第一个副本
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: elasticsearch-pv-0
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 20Gi
  nfs:
    path: /data/nfsStorage/public-service-elasticsearch-data-elasticsearch-0-pvc-5a49b9c5-097b-4040-a54e-941ade92b61d  #此路径需要查看之前minio对应的PV的路径
    server: 10.127.91.176   # 修改成新的地址
  persistentVolumeReclaimPolicy: Delete
  storageClassName: elasticsearch-data
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: elasticsearch-data-new-elasticsearch-0
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 20Gi
  storageClassName: elasticsearch-data

# 第二个副本
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: elasticsearch-pv-1
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 20Gi
  nfs:
    path: /data/nfsStorage/public-service-elasticsearch-data-elasticsearch-1-pvc-9155ce12-a6f4-4d89-9b26-7b6f25959d4d  #此路径需要查看之前minio对应的PV的路径
    server: 10.127.91.176   # 修改成新的地址
  persistentVolumeReclaimPolicy: Delete
  storageClassName: elasticsearch-data
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: elasticsearch-data-new-elasticsearch-1
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 20Gi
  storageClassName: elasticsearch-data
# 第三个副本
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: elasticsearch-pv-2
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 20Gi
  nfs:
    path: /data/nfsStorage/public-service-elasticsearch-data-elasticsearch-2-pvc-d3aa037f-6e23-4c6b-971f-61c592be2d24  #此路径需要查看之前minio对应的PV的路径
    server: 10.127.91.176   # 修改成新的地址
  persistentVolumeReclaimPolicy: Delete
  storageClassName: elasticsearch-data
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: elasticsearch-data-new-elasticsearch-2
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 20Gi
  storageClassName: elasticsearch-data

八、恢复K8S集群

1.1 变更第一台控制节点的IP

此步骤一定要注意:

  • 必须一台一台操作
  • 改完之后的IP一定跟其他节点网络是互通的,否则无法使用

1.2 恢复第一台控制节点

恢复方式为热恢复,需要依次恢复,不能同时恢复多台

1) 修改etcd, api-server, control-server, scheduler-server等静态POD的配置

# 所有master节点可同时执行此步
# cd /etc/kubernetes/manifests

sed -i "s/10.127.91.137/10.127.91.170/g" *
sed -i "s/10.127.91.138/10.127.91.171/g" *
sed -i "s/10.127.91.139/10.127.91.172/g" *

2) 修改所有master节点的hosts文件,可并行操作

sed -i "s/10.127.91.137/10.127.91.170/g" /etc/hosts
sed -i "s/10.127.91.138/10.127.91.171/g" /etc/hosts
sed -i "s/10.127.91.139/10.127.91.172/g" /etc/hosts
sed -i "s/10.127.91.140/10.127.91.173/g" /etc/hosts
sed -i "s/10.127.91.141/10.127.91.174/g" /etc/hosts
sed -i "s/10.127.91.142/10.127.91.175/g" /etc/hosts
sed -i "s/10.127.91.143/10.127.91.176/g" /etc/hosts
sed -i "s/10.127.91.144/10.127.91.177/g" /etc/hosts
sed -i "s/10.127.91.145/10.127.91.178/g" /etc/hosts

3) 修改所有master节点的haproxy文件

sed -i "s/10.127.91.137/10.127.91.170/g" /etc/haproxy/haproxy.cfg
sed -i "s/10.127.91.138/10.127.91.171/g" /etc/haproxy/haproxy.cfg
sed -i "s/10.127.91.139/10.127.91.172/g" /etc/haproxy/haproxy.cfg

# 修改完haproxy需要重启
systemctl restart haproxy

4) 修改ETCD数据

这一步非常重要,一定要操作无误,确认好节点的对应关系

参考文档:https://pytimer.github.io/2019/05/change-etcd-cluster-member-ip/

etcdctl命令行工具下载:https://github.com/etcd-io/etcd/releases/download/v3.4.3/etcd-v3.4.3-linux-amd64.tar.gz

# 执行此命令可以获取ETCD集群的成员列表,以及会打印ETCD集群中节点的一个唯一ID,通过此ID进行成员变更操作
# 这一步操作需要去另外一台主节点上操作(任一主节点)

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
# 执行结果
1bbc5fd9dc1687a, started, k8s-master-91-139, https://10.127.91.139:2380, https://10.127.91.139:2379, false
440629d2b8ac46bc, started, k8s-master-91-137, https://10.127.91.137:2380, https://10.127.91.137:2379, false
47f4d2a7f3d37348, started, k8s-master-91-138, https://10.127.91.138:2380, https://10.127.91.138:2379, false

# 通过以上执行结果可以看到,我们变更的第一台主节点的IP并未跟着改变,因此需要执行下面的命令手动update一下
# 440629d2b8ac46bc 这一串随机字符串就是第一台控制节点的唯一ID标识,通过此标识进行update操作
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
    member update 440629d2b8ac46bc --peer-urls=https://10.127.91.170:2380

# 输出结果如下
Member 440629d2b8ac46bc updated in cluster 55c364da67e7bd4f

# 再次查看ETCD成员列表

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list

# 输出结果
1bbc5fd9dc1687a, started, k8s-master-91-139, https://10.127.91.139:2380, https://10.127.91.139:2379, false
440629d2b8ac46bc, started, k8s-master-91-137, https://10.127.91.170:2380, https://10.127.91.137:2379, false
47f4d2a7f3d37348, started, k8s-master-91-138, https://10.127.91.138:2380, https://10.127.91.138:2379, false

# 可以看到peer-urls已经更改

执行完以上操作后,还存在一个问题,就是关于etcd的证书还未变更,ETCD集群初始化时并未设置一个静态地址,因此会将节点的IP信息生成到证书中,接下来需要删除ETCD的相关证书,重新生成一下

# 可以查看证书中的相关信息
# 指定具体的证书
openssl x509 -noout -text -in *.crt

5) 重新生成证书

ETCD证书重新生成

# 在第一台控制节点上操作

cd /etc/kubernetes/manifests/pki

rm -f server.*
rm -f healthcheck-client.*
rm -f peer.*

# 重新生成证书,为了方便这里直接生成所有证书,存在的情况下不会覆盖。

kubeadm init phase certs all

apiserver证书重新生成

cd /etc/kubernetes/pki
rm -f apiserver.*

kubeadm init phase certs apiserver \
--apiserver-advertise-address 10.127.91.171 \
--apiserver-cert-extra-sans ha.api.k8s.gm \
--apiserver-cert-extra-sans 172.17.0.10 \
--apiserver-cert-extra-sans 172.17.0.1 \
--apiserver-cert-extra-sans 127.0.0.1 \
--apiserver-cert-extra-sans kubernetes \
--apiserver-cert-extra-sans kubernetes.default \
--apiserver-cert-extra-sans kubernetes.default.svc \
--apiserver-cert-extra-sans kubernetes.default.svc.cluster.local

证书替换完成后需要重启下kubelet和docker

# 理论上只重启kubelet就可以了,这里顺便也重启下docker
systemctl restart docker
systemctl restart kubelet

6) 重启相关k8s组件

docker ps -a -q --filter name=k8s_kube | xargs -r docker rm --force --volumes
docker ps -a -q --filter name=k8s_etcd | xargs -r docker rm --force --volumes

7) 修改所有工作节点的hosts, haproxy等配置

sed -i 's/10.127.91.137/10.127.91.170/g' /etc/haproxy/haproxy.cfg
sed -i 's/10.127.91.138/10.127.91.171/g' /etc/haproxy/haproxy.cfg
sed -i 's/10.127.91.139/10.127.91.172/g' /etc/haproxy/haproxy.cfg

sed -i 's/10.127.91.137/10.127.91.170/g' /etc/hosts
sed -i 's/10.127.91.138/10.127.91.171/g' /etc/hosts
sed -i 's/10.127.91.139/10.127.91.172/g' /etc/hosts

至此第一台主节点已经完成

1.3 变更第二台主节点的IP

1.4 修改第二台master节点

1.5 变更第三台主节点的IP

1.6 修改第三台master节点

1.7 查看ETCD集群状态

当所有的主节点IP都变更完成,且完成主节点恢复之后,可以查看下ETCD集群状态

# ETCDCTL_API=3 etcdctl -w table --endpoints=https://10.127.91.170:2379,https://10.127.91.171:2379,https://10.127.91.172:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint status
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.127.91.170:2379 | 440629d2b8ac46bc |   3.4.3 |  7.4 MB |     false |      false |       779 |      93090 |              93090 |        |
| https://10.127.91.171:2379 | 47f4d2a7f3d37348 |   3.4.3 |  7.3 MB |      true |      false |       779 |      93090 |              93090 |        |
| https://10.127.91.172:2379 |  1bbc5fd9dc1687a |   3.4.3 |  7.4 MB |     false |      false |       779 |      93090 |              93090 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
文末附加内容
暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇
下一篇