Redis Cluster 장애 복구

espossible 2022. 2. 8. 15:10

2022. 2. 8. 15:10

Redis Cluster 구성

- Slave를 1개 가지고 있는 Master 노드 3개. 즉, Master 노드 3개, Slave 노드 3개

- redis.conf 정보

port 6379
cluster-require-full-coverage yes
cluster-enabled yes
cluster-config-file /data/nodes.conf
cluster-node-timeout 5000 # 약 5~6초 후에 failed 확인
appendonly yes
appendfilename appendonly.aof
dbfilename dump.rdb
logfile /data/log.log

Master 노드를 kill 했을 때

Slave 노드가 Master로 승격, 기존 Master가 재시작되면 Slave로 전환됨.

1:S 28 Jan 05:06:51.393 # Connection with master lost.
1:S 28 Jan 05:06:51.393 * Caching the disconnected master state.
1:S 28 Jan 05:06:51.672 * Connecting to MASTER 10.244.0.196:6379
1:S 28 Jan 05:06:51.673 * MASTER <-> SLAVE sync started
1:S 28 Jan 05:06:57.068 * FAIL message received from 1819c808fc29701676fde27691b5f57de297dfd8 about f1124d4c5e1bae7eae1107bc4fc2d9543132aeb3
1:S 28 Jan 05:06:57.069 # Cluster state changed: fail
1:S 28 Jan 05:06:57.092 # Start of election delayed for 916 milliseconds (rank #0, offset 720).
1:S 28 Jan 05:06:58.095 # Starting a failover election for epoch 7.
1:S 28 Jan 05:06:58.099 # Failover election won: I'm the new master.
1:S 28 Jan 05:06:58.099 # configEpoch set to 7 after successful failover
1:M 28 Jan 05:06:58.099 # Setting secondary replication ID to 951fb8aac85409a8fc5dad74b059262134613f5d, valid up to offset: 721. New replication ID is dc7af59ee82d9622a4097f458eab4df13a8e8b15
1:M 28 Jan 05:06:58.099 * Discarding previously cached master state.
1:M 28 Jan 05:06:58.100 # Cluster state changed: ok

다른 master 노드의 로그 확인

1:M 28 Jan 05:06:57.068 * Marking node f1124d4c5e1bae7eae1107bc4fc2d9543132aeb3 as failing (quorum reached).
1:M 28 Jan 05:06:57.068 # Cluster state changed: fail
1:M 28 Jan 05:06:58.097 # Failover auth granted to 8c67f0954711e074afd1fb34ebb7cc24c6cc73e0 for epoch 7
1:M 28 Jan 05:06:58.102 # Cluster state changed: ok

다른 slave 노드의 로그 확인

1:S 28 Jan 05:06:57.068 * FAIL message received from 1819c808fc29701676fde27691b5f57de297dfd8 about f1124d4c5e1bae7eae1107bc4fc2d9543132aeb3
1:S 28 Jan 05:06:57.068 # Cluster state changed: fail
1:S 28 Jan 05:06:58.102 # Cluster state changed: ok

마스터 노드가 kill 되었을 경우 (spring boot)lettuce client 로그

2022-01-27 05:01:02.909 DEBUG 1 --- [llEventLoop-4-1] i.l.core.protocol.ConnectionWatchdog     : Cannot reconnect to [10.244.3.34:6379]: finishConnect(..) failed: Host is unreachable: /10.244.3.34:6379

io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Host is unreachable: /10.244.3.34:6379
Caused by: java.net.ConnectException: finishConnect(..) failed: Host is unreachable
	at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at io.netty.channel.unix.Socket.finishConnect(Socket.java:251) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]

토폴로지 구성이 업데이트 되지 않으면 레디스 클러스터가 복구 되었음에도 접속할 수 없다.

반드시 토폴로지 구성을 리프레시 하는 설정을 클라이언트(lettuce)에 작성해야한다.

참고로 Slave 노드를 kill 했을 경우 클라이언트 혹은 클러스터에 영향을 미치는 것이 없다.

이전 마스터였던 노드가 재시작한 경우

이전 슬레이드 노드가 마스터 노드로 승격되고, 승격된 마스터 노드의 슬레이브가 된다.

다른 master/slave 노드 로그

1:M 28 Jan 05:10:12.299 # Address updated for node f1124d4c5e1bae7eae1107bc4fc2d9543132aeb3, now 10.244.0.208:6379
1:M 28 Jan 05:10:12.370 * Clear FAIL state for node f1124d4c5e1bae7eae1107bc4fc2d9543132aeb3: master without slots is reachable again.

출처

- k8s configmap : https://stackoverflow.com/questions/58602949/redis-cluster-in-kubernetes-doesnt-write-nodes-conf-file

- https://blog.actorsfit.com/a?ID=01700-6f44e1d7-7ebc-4c13-bd14-d6a8cddd7e52

- http://redisgate.kr/redis/cluster/cluster_failover.php

- http://redisgate.kr/redis/cluster/cluster-node-timeout.php

'빅데이터 > Redis Cluster' 카테고리의 다른 글

Redis Configuration (0)	2022.02.08
Redis 주요 명령어 (0)	2022.02.08
Lettuce Configuration(RedisTemplate) (0)	2022.02.08
Redis Cluster Overview (0)	2022.02.08

Slow And Steady Wins The Race