What happens when we loose ViPR nodes?

ViPR failureI decided to check what ViPR admins will see in the case of one or two nodes fail.

My config: 2+1 ViPR v2.1 P1

  • ViPR is up and running.

vipr-failure-10

  • Let’s shutdown the vipr3 VM.

vipr-failure-2

  • The status of ViPR is Degraded now.

vipr-failure-3

  • We can see “Could not connect to server” Error in the geosvc logs.

vipr-failure-4

  • Let’s shutdown the second vipr2 VM.

vipr-failure-5

  • ViPR is unavailable via browser now.

vipr-failure-6

  • Let’s connect to the remaining vipr1 VM and check some logs.
myhost$ ssh root@vipr1
You are accessing ViPR. By using this system you consent to the owning organization's terms and conditions.
Password:
Last login: Fri Nov 28 13:40:46 2014 from myhost
Welcome to vipr-2.1.0.0.384

vipr1:~ # cat /opt/storageos/logs/warn
...
Dec 3 09:49:52 localhost Request took an unusually long time to complete: 195,315ms
Dec 3 09:49:52 localhost Number of concurrent requests is high: 10. Avg response time recently is: 0ms
...

vipr1:~ # cat /opt/storageos/logs/dbsvc.log
...
2014-12-03 09:49:51,493 [main-SendThread(vipr2:2181)] WARN ClientCnxn.java (line 1089) Session 0x349a32540860001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
...

vipr1:~ # cat /opt/storageos/logs/authsvc.log
...
2014-12-03 09:49:32,749 [ProviderConfigUpdater] ERROR CustomAuthenticationManager.java (line 458) Exception loading authentication provider configuration from db, will a retry in 2 secs
com.emc.storageos.db.exceptions.RetryableDatabaseException: Database connection failed
at com.emc.storageos.db.client.constraint.impl.QueryHitIterator.runQuery(QueryHitIterator.java:45)
at com.emc.storageos.db.client.constraint.impl.FilteredQueryHitIterator.runQuery(FilteredQueryHitIterator.java:31)
...
Caused by: UnavailableException()
at org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11645)
at org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11603)
at org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:11529)
...

vipr1:~ # cat /opt/storageos/logs/syssvc.log
...
2014-12-03 09:48:02,815 [RingDescribeAutoDiscovery] WARN HostSupplierImpl.java (line 102) no dbsvc instance running. Coordinator exception message: Error occurred while finding the node with path /service/geodbsvc/2.1
2014-12-03 09:48:02,815 [RingDescribeAutoDiscovery] WARN HostSupplierImpl.java (line 69) hostsupplier is empty. May be dbsvc hasn't started yet. waiting for 10000 msec
2014-12-03 09:48:02,816 [RingDescribeAutoDiscovery] WARN HostSupplierImpl.java (line 102) no dbsvc instance running. Coordinator exception message: Error occurred while finding the node with path /service/dbsvc/2.1
...

vipr1:~ # cat /opt/storageos/logs/coordinatorsvc.log
...
2014-12-03 09:49:33,004 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN NIOServerCnxn.java (line 354) Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
...
  • Let’s check Cassandra status.
vipr1:~ # /opt/storageos/bin/nodetool --host localhost ring
...
==========
Address Rack Status State Load Owns Token
9104406945440735131
10.10.10.163 rack1 Down Normal 5.77 MB 34.86% -9217526110072942982
10.10.10.161 rack1 Up Normal 5.66 MB 29.98% -9208447921831706602
10.10.10.162 rack1 Down Normal 5.96 MB 35.15% -9191728173294922023
  • Let’s check Zookeeper status.
vipr1:~ # /opt/storageos/bin/zkCli.sh
 ...
 [zk: localhost:2181(CONNECTING) 0] 2014/12/03 11:24:42.962 INFO main-SendThread(localhost:2181) [ClientCnxn] Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
 2014/12/03 11:24:42.962 INFO main-SendThread(localhost:2181) [ClientCnxn] Socket connection established to localhost/127.0.0.1:2181, initiating session
 2014/12/03 11:24:42.963 INFO main-SendThread(localhost:2181) [ClientCnxn] Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
  • Telnet to the port 2181 and next type “ruok” and “stat”
vipr1:~ # telnet 10.10.10.161 2181
 Trying 10.10.10.161...
 Connected to 10.10.10.161.
 Escape character is '^]'.
 ruok
 imok
 Connection closed by foreign host.
 vipr1:~ # telnet 10.10.10.161 2181
 Trying 10.10.10.161...
 Connected 10.10.10.161.
 Escape character is '^]'.
 stat
 This ZooKeeper instance is not currently serving requests
 Connection closed by foreign host.

Status imok means that Zookeeper is running, but it doesn’t serve any connections.

  • Let’s return back the vipr2 VM.

vipr-failure-9

  • After about 7 minutes ViPR automatically returns back. No needs in any manual actions. The ViPR status is Degraded now.

vipr-failure-3

  • Let’s check the Zookeeper status now
vipr1:~ # /opt/storageos/bin/zkCli.sh
 ...
 Welcome to ZooKeeper!
 2014/12/03 11:39:06.990 INFO main-SendThread(localhost:2181) [ClientCnxn] Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
 2014/12/03 11:39:06.998 INFO main-SendThread(localhost:2181) [ClientCnxn] Socket connection established to localhost/127.0.0.1:2181, initiating session
 JLine support is enabled
 2014/12/03 11:39:07.013 INFO main-SendThread(localhost:2181) [ClientCnxn] Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14a0fef7557000b, negotiated timeout = 30000
 WATCHER::
 WatchedEvent state:SyncConnected type:None path:null
 [zk: localhost:2181(CONNECTED) 0] quit ... please don't experiment with any commands here !

vipr1:~ # telnet 10.76.246.161 2181
 Trying 10.76.246.161...
 Connected to 10.76.246.161.
 Escape character is '^]'.
 stat
 Zookeeper version: 3.4.5--1, built on 03/11/2014 02:05 GMT
 Clients:
 /10.10.10.161:36749[1](queued=0,recved=11616,sent=11616)
 /10.10.10.161:36932[1](queued=0,recved=10435,sent=10435)
 /10.10.10.161:36828[1](queued=0,recved=10797,sent=10858)
 /10.10.10.162:59295[1](queued=0,recved=41,sent=41)
 ...
  • Let’s check the Cassandra status.
vipr1:~ # /opt/storageos/bin/nodetool --host localhost ring
 ...
 ==========
 Address Rack Status State Load Owns Token
 9104406945440735131
 10.76.246.163 rack1 Down Normal 5.77 MB 34.86% -9217526110072942982
 10.76.246.161 rack1 Up Normal 5.62 MB 29.98% -9208447921831706602
 10.76.246.162 rack1 Up Normal 7.14 MB 35.15% -9191728173294922023
 ...
  • Let’s return the last vipr3 VM.

vipr-failure-7

  • ViPR is in the “Stable” state now.

vipr-failure-1

  • Zookeeper is OK
 vipr1:~ # telnet 10.10.10.161 2181
 Trying 10.76.246.161...
 Connected to 10.76.246.161.
 Escape character is '^]'.
 stat
 Zookeeper version: 3.4.5--1, built on 03/11/2014 02:05 GMT
 Clients:
 /10.76.246.163:44285[1](queued=0,recved=12465,sent=12465)
 /10.76.246.163:44461[1](queued=0,recved=346,sent=346)
 /10.76.246.161:36749[1](queued=0,recved=21864,sent=21864)
 /10.76.246.161:36932[1](queued=0,recved=23130,sent=23130)
 /10.76.246.161:36828[1](queued=0,recved=21091,sent=21213)
 /10.76.246.163:44336[1](queued=0,recved=336,sent=336)
 /10.76.246.162:59295[1](queued=0,recved=399,sent=399)
...
  •  Cassandra DB is OK as well.
vipr1:~ # /opt/storageos/bin/nodetool --host localhost ring
 …
 ==========
 Address Rack Status State Load Owns Token
 9104406945440735131
 10.76.246.163 rack1 Up Normal 7.01 MB 34.86% -9217526110072942982
 10.76.246.161 rack1 Up Normal 5.61 MB 29.98% -9208447921831706602
 10.76.246.162 rack1 Up Normal 7.14 MB 35.15% -9191728173294922023
 ...

Added…

  • Data Base is repaired for some time. To check that:
vipr1:~ # cat dbsvc.log|grep repair
 ...
 2014-12-03 19:48:51,601 [RepairJobTask:1] INFO RepairJob.java (line 134) [repair #66949f20-7b25-11e4-8d63-4f8c453456f0] requesting merkle trees for StoragePoolCapabilities (to [/10.76.246.162, /10.76.246.163, /10.76.246.161])
 2014-12-03 19:48:51,634 [AntiEntropyStage:1] INFO RepairSession.java (line 164) [repair #66949f20-7b25-11e4-8d63-4f8c453456f0] Received merkle tree for StoragePoolCapabilities from /10.76.246.162
 2014-12-03 19:48:51,679 [AntiEntropyStage:1] INFO RepairSession.java (line 164) [repair #66949f20-7b25-11e4-8d63-4f8c453456f0] Received merkle tree for StoragePoolCapabilities from /10.76.246.163
 2014-12-03 19:48:51,713 [AntiEntropyStage:1] INFO RepairSession.java (line 164) [repair #66949f20-7b25-11e4-8d63-4f8c453456f0] Received merkle tree for StoragePoolCapabilities from /10.76.246.161
 2014-12-03 19:48:51,716 [RepairJobTask:1] INFO Differencer.java (line 67) [repair #66949f20-7b25-11e4-8d63-4f8c453456f0] Endpoints /10.76.246.162 and /10.76.246.163 are consistent for StoragePoolCapabilities
 2014-12-03 19:48:51,718 [RepairJobTask:6] INFO Differencer.java (line 67) [repair #66949f20-7b25-11e4-8d63-4f8c453456f0] Endpoints /10.76.246.162 and /10.76.246.161 are consistent for StoragePoolCapabilities
 2014-12-03 19:48:51,718 [RepairJobTask:4] INFO Differencer.java (line 67) [repair #66949f20-7b25-11e4-8d63-4f8c453456f0] Endpoints /10.76.246.163 and /10.76.246.161 are consistent for StoragePoolCapabilities
 2014-12-03 19:48:51,719 [AntiEntropyStage:1] INFO RepairSession.java (line 221) [repair #66949f20-7b25-11e4-8d63-4f8c453456f0] StoragePoolCapabilities is fully synced
 ...
  • Finally you should see that DB repair is finished.
vipr1:~ # cat dbsvc.log|grep Ending
 2014-12-03 20:20:11,775 [DBRepairPool_293] INFO NodeRepairExecutor.java (line 61) Ending repair

Added…

Instead of /opt/storageos/bin/nodetool –host localhost ring
just /opt/storageos/bin/nodetool status can be used

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s