Kubernetes TroubleShooting

클라우드 컴퓨팅/쿠버네티스

Kubernetes TroubleShooting

espossible 2021. 9. 11. 14:37

쿠버네티스에서 자주 발생하는 장애상황과 해결방법에 대해서 살펴보자.

- Application Failure

- Control Plane Failure

- Worker Node Failure

- Network Failure

Case 1) Application Failure

1. 서비스 target port를 잘못 지정한 경우

2. 서비스명을 잘못 지정한 경우

3. 환경설정이 잘못된 경우(예를 들면 데이터베이스 비밀번호)

4. 서비스의 selector 설정을 잘못 지정한 경우

What to do

# 유저가 서비스 이용을 할 수 있는지 확인
curl http://web-service-ip:port

# 서비스의 엔드포인트가 할당이 되었는지 확인
kubectl get ep

# 환경설정에 문제가 없는지 확인
kubectl describe po <pod_name>
kubectl logs -f <pod_name> (--previous)

Case 2) Control Plane Failure

출처 :   https://v1-18.docs.kubernetes.io/docs/concepts/overview/components/

What to do

# controlplane pod 상태 검사
kubectl get po -n kube-system

# controlplane pod 로그 확인
k get po -n kube-system | grep control
k logs -f <pod_name> -n kube-system

etcd-kind-cluster-control-plane                       1/1     Running   0          28m
etcd-kind-cluster-control-plane2                      1/1     Running   0          27m
etcd-kind-cluster-control-plane3                      1/1     Running   0          27m
kube-apiserver-kind-cluster-control-plane             1/1     Running   0          28m
kube-apiserver-kind-cluster-control-plane2            1/1     Running   0          27m
kube-apiserver-kind-cluster-control-plane3            1/1     Running   1          27m
kube-controller-manager-kind-cluster-control-plane    1/1     Running   1          28m
kube-controller-manager-kind-cluster-control-plane2   1/1     Running   0          27m
kube-controller-manager-kind-cluster-control-plane3   1/1     Running   0          27m
kube-scheduler-kind-cluster-control-plane             1/1     Running   1          28m
kube-scheduler-kind-cluster-control-plane2            1/1     Running   0          26m
kube-scheduler-kind-cluster-control-plane3            1/1     Running   0          27m

만약 필수적으로 있어야하는 컴포넌트가 kube-system 네임스페이스의 pod 리스트에 나타나지 않는다면 manifest 파일을 확인해하자.

몇몇 컴포넌트들은 pod 이름에 실행중인 노드의 이름(static pod)이 붙는다.(kube-apiserver-kind-cluster-control-plane)

static pod의 manifest 파일의 위치는 보통 /etc/kubernetes/manifest 이다. manifest 파일을 열어보고 설정이 잘못되어있는지 확인한다.

root@controlplane:/etc/kubernetes/manifests# ls
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

참고로 시스템 로그를 확인하고 싶다면 journalctl를 활용하자.

sudo journalctl -u kube-apiserver

Worker Node Failure

- kubelet

- kube-proxy

What todo

# 워커 노드 조회
k get no | grep worker

kind-cluster-worker           Ready    <none>   40m   v1.18.2
kind-cluster-worker2          Ready    <none>   40m   v1.18.2
kind-cluster-worker3          Ready    <none>   40m   v1.18.2

# 워커노드 상세 조회
k describe no kind-cluster-worker

kind-cluster-worker
Name:               kind-cluster-worker
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=kind-cluster-worker
                    kubernetes.io/os=linux
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sat, 11 Sep 2021 13:33:52 +0900
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  kind-cluster-worker
  AcquireTime:     <unset>
  RenewTime:       Sat, 11 Sep 2021 14:15:11 +0900
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sat, 11 Sep 2021 14:14:10 +0900   Sat, 11 Sep 2021 13:33:52 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sat, 11 Sep 2021 14:14:10 +0900   Sat, 11 Sep 2021 13:33:52 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sat, 11 Sep 2021 14:14:10 +0900   Sat, 11 Sep 2021 13:33:52 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sat, 11 Sep 2021 14:14:10 +0900   Sat, 11 Sep 2021 13:34:12 +0900   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.18.0.5
  Hostname:    kind-cluster-worker
Capacity:
  cpu:                6
  ephemeral-storage:  61255492Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             2034968Ki
  pods:               110
Allocatable:
  cpu:                6
  ephemeral-storage:  61255492Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             2034968Ki
  pods:               110
System Info:
  Machine ID:                 ec42c7f79c1042d7ab10c3d1374cce50
  System UUID:                28ec7417-523f-4269-b855-0868e56b2a17
  Boot ID:                    fbeca5f2-a9ad-45b8-a6a5-f50b368b90f9
  Kernel Version:             5.10.25-linuxkit
  OS Image:                   Ubuntu 20.04 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.3.3-14-g449e9269
  Kubelet Version:            v1.18.2
  Kube-Proxy Version:         v1.18.2
PodCIDR:                      10.244.4.0/24
PodCIDRs:                     10.244.4.0/24
ProviderID:                   kind://docker/kind-cluster/kind-cluster-worker
Non-terminated Pods:          (2 in total)
  Namespace                   Name                CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                ------------  ----------  ---------------  -------------  ---
  kube-system                 kindnet-8mspw       100m (1%)     100m (1%)   50Mi (2%)        50Mi (2%)      41m
  kube-system                 kube-proxy-rb6tg    0 (0%)        0 (0%)      0 (0%)           0 (0%)         41m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                100m (1%)  100m (1%)
  memory             50Mi (2%)  50Mi (2%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:
  Type    Reason                   Age                From        Message
  ----    ------                   ----               ----        -------
  Normal  NodeHasSufficientMemory  41m (x8 over 41m)  kubelet     Node kind-cluster-worker status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    41m (x8 over 41m)  kubelet     Node kind-cluster-worker status is now: NodeHasNoDiskPressure
  Normal  Starting                 41m                kube-proxy  Starting kube-proxy.

워커 노드를 상세조회하면 Condition 항목이 있는데 이를 통해 OutOfDisk, MemoryPresure, DiskPresure, PIDPresure 상태를 살펴볼 수 있다.

예를들면 MemoryPresure 생태가 True이면 메모리가 부족하다는 이야기이다.

각 노드에 접속하여 top, df -h 등으로 보다 상세하게 노드의 상태를 검사할 수 있다.

가끔 상태가 Unknown인 경우가 있는데 이때는 kubelet 상태를 의심해보아야한다.

service kubelet status
sudo journalctl -u kubelet (-f)

보통 kubelet 설정 파일의 위치는 /var/lib/kubelet/config 이므로 파일을 열어 설정이 잘못되었는지 확인한다.

그 뒤 변경이 있으면 systemctl daemon-reload -> systemctl restart kubelet 을 실행하여 변경내용을 적용하자.

혹은 kubelet manifest 파일 내용이 잘못되었을지 모른다.

파일의 위치는 보통 /etc/kubernetes/kubelet.conf 이므로 잘못된 정보가 있는지 확인한다.

예를들면 kube-apiserver의 port가 잘못설정되어있을 수 있다.

참고) kubelet certificate 확인

openssl -x509 -in /var/lib/kubelet/worker-1.crt -text(right CA, right croup ...)


Certificate:
	Data: xxx
    Signature Algorithm: xxx
    	Issuer: xxx
        Validity
        	Not Before: xxx
            Not After : xxx
        Subject: xxx
        ...

Networking Failure

- CNI : kubelet configuration에서 cin-bin-dir, network-pluin 설정이 제대로 되어 있는지 확인.

- DNS

- Proxy

What to do for DNS failure

kubectl edit cm coredns -n kube-system

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2021-09-11T04:32:04Z"
  name: coredns
  namespace: kube-system
  resourceVersion: "247"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: 2f2d4f33-0ef1-4e37-b215-2e4034aac6f7

Port 53은 DNS resolution으로 사용되고 있다.

coreDNS 문제는 3가지로 나뉠 수 있다.

첫째, coredns pod가 pending 상태일 경우 : 네트워크 플러그인이 정상적으로 설치되었는지 확인하자.

둘째, coredns pod가 CrashLoopBackOff 또는 에러 상태일 경우 : 설치된 OS와 도커의 버전 및 권한 문제일 수 있다.

셋째, coredns와 관련된 pod가 모두 정상일 경우 : 서비스 엔드포인트, selector, port를 확인하자.

kubectl get ep kube-dns -n kube-system
NAME       ENDPOINTS                                                 AGE
kube-dns   10.244.0.3:53,10.244.0.4:53,10.244.0.3:9153 + 3 more...   72m

What to do for Proxy failure

kubeproxy is responsible for watching services and endpoint associated with each service. When the client is going to connect to the service using the virtual IP the kubeproxy is responsible for sending traffic to actual pods.

kube-proxy는 daemonset으로 배포된다.

k describe ds kube-proxy -n kube-system

Name:           kube-proxy
Selector:       k8s-app=kube-proxy
Node-Selector:  kubernetes.io/os=linux
Labels:         k8s-app=kube-proxy
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 6
Current Number of Nodes Scheduled: 6
Number of Nodes Scheduled with Up-to-date Pods: 6
Number of Nodes Scheduled with Available Pods: 6
Number of Nodes Misscheduled: 0
Pods Status:  6 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           k8s-app=kube-proxy
  Service Account:  kube-proxy
  Containers:
   kube-proxy:
    Image:      k8s.gcr.io/kube-proxy:v1.18.2
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/local/bin/kube-proxy
      --config=/var/lib/kube-proxy/config.conf
      --hostname-override=$(NODE_NAME)
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/kube-proxy from kube-proxy (rw)
  Volumes:
   kube-proxy:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-proxy
    Optional:  false
   xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
   lib-modules:
    Type:               HostPath (bare host directory volume)
    Path:               /lib/modules
    HostPathType:       
  Priority Class Name:  system-node-critical
Events:                 <none>

kube-proxy pod 가 정상 상태가 아니거나 로그 확인 결과 문제점이 있는 경우가 있다.

k get po -n kube-system | grep kube-proxy
k logs <pod_name> -n kube-system

kube-proxy-4g89x                                      1/1     Running   0          83m
kube-proxy-6tp9m                                      1/1     Running   0          85m
kube-proxy-lvvh5                                      1/1     Running   0          84m
kube-proxy-r5vqz                                      1/1     Running   0          83m
kube-proxy-rb6tg                                      1/1     Running   0          83m
kube-proxy-t58hp                                      1/1     Running   0          83m

또는 컨테이너 내부에서 kube-proxy가 정상적으로 동작하고 있지 않은 경우가 있다.

# netstat -plan | grep kube-proxy
tcp        0      0 0.0.0.0:30081           0.0.0.0:*               LISTEN      1/kube-proxy
tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      1/kube-proxy
tcp        0      0 172.17.0.12:33706       172.17.0.12:6443        ESTABLISHED 1/kube-proxy
tcp6       0      0 :::10256                :::*

이때에는 kube-proxy damonset 정의가 올바르게 되어 있는지 확인해보자. 특히 kube-proxy binary 가 올바르게 작성되어있는지 확인하자.

출처

- https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

- https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/

- https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/

- https://kind.sigs.k8s.io/docs/user/quick-start/

- https://v1-18.docs.kubernetes.io/docs/concepts/overview/components/