Quantcast
Channel: MySQL Forums - NDB clusters
Viewing all articles
Browse latest Browse all 1562

Unexpected shutdown (no replies)

$
0
0
Hi Guys,

I am evaluating Cluster for use in my company. Out of the blue this morning NDB decided to shut it's self down. It did not restart, and I had to manually restart the 'ndbd' processes after realising that they were not running.

This would be very worrying if used in a production environment. I wonder whether anybody could give an opinion of what happened, and how to stop it in the future?

Version 7.2.5

Two data nodes, 4 and 5.

Node 5 logged this line at the time of the failure:

2012-09-10 10:18:17 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Performing Send elapsed=103

From there, both nodes reported Watchdog errors, which concluded with both nodes shutting down. (Logs included at end of positing.)

Any advise of what happened, or how to stop it again?

Regards, Ben.

----------------------------LOGS---------------------------

Data (Node 5)


2012-09-10 10:18:17 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Performing Send elapsed=103
2012-09-10 10:18:28 [ndbd] INFO -- Watchdog: User time: 2506963 System time: 1647468
2012-09-10 10:18:28 [ndbd] INFO -- timerHandlingLab now: 11355449007 sent: 11355448828 diff: 179
2012-09-10 10:18:28 [ndbd] WARNING -- Time moved forward with 10568 ms
2012-09-10 10:18:28 [ndbd] WARNING -- timerHandlingLab now: 11355459420 sent: 11355449007 diff: 10413
2012-09-10 10:18:28 [ndbd] INFO -- Watchdog: User time: 2506967 System time: 1647473
2012-09-10 10:18:28 [ndbd] WARNING -- Watchdog: Warning overslept 10544 ms, expected 100 ms.
2012-09-10 10:18:28 [ndbd] INFO -- Watchdog: User time: 2506967 System time: 1647473
2012-09-10 10:18:28 [ndbd] WARNING -- Watchdog: Warning overslept 578 ms, expected 100 ms.
2012-09-10 10:18:36 [ndbd] INFO -- findNeighbours from: 4861 old (left: 4 right: 4) new (65535 65535)
2012-09-10 10:18:36 [ndbd] INFO -- Arbitrator decided to shutdown this node
2012-09-10 10:18:36 [ndbd] INFO -- QMGR (Line: 5975) 0x00000002
2012-09-10 10:18:36 [ndbd] INFO -- Error handler shutting down system
2012-09-10 10:18:36 [ndbd] INFO -- Error handler shutdown completed - exiting
2012-09-10 10:18:37 [ndbd] ALERT -- Node 5: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.


Data (Node 4)

2012-09-10 10:18:28 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=100
2012-09-10 10:18:28 [ndbd] INFO -- Watchdog: User time: 2998422 System time: 1847367
2012-09-10 10:18:28 [ndbd] INFO -- Watchdog: User time: 2998422 System time: 1847367
2012-09-10 10:18:28 [ndbd] WARNING -- Watchdog: Warning overslept 2206 ms, expected 100 ms.
2012-09-10 10:18:28 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=100
2012-09-10 10:18:28 [ndbd] INFO -- Watchdog: User time: 2998422 System time: 1847367
2012-09-10 10:18:28 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=200
2012-09-10 10:18:28 [ndbd] INFO -- Watchdog: User time: 2998422 System time: 1847367
2012-09-10 10:18:28 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=300
2012-09-10 10:18:28 [ndbd] INFO -- Watchdog: User time: 2998422 System time: 1847367
2012-09-10 10:18:28 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=401
2012-09-10 10:18:28 [ndbd] INFO -- Watchdog: User time: 2998422 System time: 1847367
2012-09-10 10:18:28 [ndbd] INFO -- Arbitrator decided to shutdown this node
2012-09-10 10:18:28 [ndbd] INFO -- QMGR (Line: 5975) 0x00000002
2012-09-10 10:18:28 [ndbd] INFO -- Error handler shutting down system
2012-09-10 10:18:28 [ndbd] INFO -- Error handler shutdown completed - exiting
2012-09-10 10:18:31 [ndbd] ALERT -- Node 4: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(
Arbitration error). Temporary error, restart node'.



Management (Node 1):

2012-09-10 10:18:27 [MgmtSrvr] WARNING -- Node 4: GCP Monitor: GCP_COMMIT lag 0 seconds (max lag: 13)
2012-09-10 10:18:27 [MgmtSrvr] WARNING -- Node 4: Node 1 missed heartbeat 2
2012-09-10 10:18:27 [MgmtSrvr] WARNING -- Node 4: Node 1 missed heartbeat 3
2012-09-10 10:18:27 [MgmtSrvr] WARNING -- Node 4: Node 5 missed heartbeat 2
2012-09-10 10:18:27 [MgmtSrvr] WARNING -- Node 4: Node 5 missed heartbeat 3
2012-09-10 10:18:27 [MgmtSrvr] WARNING -- Node 5: Node 1 missed heartbeat 2
2012-09-10 10:18:28 [MgmtSrvr] ALERT -- Node 1: Node 4 Disconnected
2012-09-10 10:18:28 [MgmtSrvr] ALERT -- Node 1: Node 5 Disconnected
2012-09-10 10:18:34 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2012-09-10 10:18:36 [MgmtSrvr] INFO -- Node 1: Node 5 Connected
2012-09-10 10:18:37 [MgmtSrvr] ALERT -- Node 1: Node 5 Disconnected
2012-09-10 10:18:37 [MgmtSrvr] ALERT -- Node 5: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

Viewing all articles
Browse latest Browse all 1562

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>