I am running a MySQL cluster. The details for version and node number is list below:
======================================================================
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=10 @172.16.7.52 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 0, Master)
id=11 @172.16.7.53 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 0)
id=12 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 1)
id=13 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 1)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @172.17.7.51 (mysql-5.1.51 ndb-7.2.0)
[mysqld(API)] 4 node(s)
id=20 @172.16.7.52 (mysql-5.1.51 ndb-7.2.0)
id=21 @172.16.7.53 (mysql-5.1.51 ndb-7.2.0)
id=22 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0)
id=23 @61.56.8.155 (mysql-5.1.51 ndb-7.2.0)
=======================================================================
My problem is that in all four data nodes, there is always one data node will be shutdown beacuse some kind of connection failure. After that, the cluster works stable with three data nodes remain. It looks like:
========================================================================
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=10 @172.16.7.52 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 0, Master)
id=11 @172.16.7.53 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 0)
id=12 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 1)
id=13 (not connected, accepting connect from ndb-node4)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @172.17.7.51 (mysql-5.1.51 ndb-7.2.0)
[mysqld(API)] 4 node(s)
id=20 @172.16.7.52 (mysql-5.1.51 ndb-7.2.0)
id=21 @172.16.7.53 (mysql-5.1.51 ndb-7.2.0)
id=22 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0)
id=23 @61.56.8.155 (mysql-5.1.51 ndb-7.2.0)
=========================================================================
After I restart the dead data node, cluster could run normally with all four data nodes for 2~3 hours. And then, it happens again. By the way, the dead data node is not always the same one.
I run all nodes in cluster on VM, OS is RHEL 5.5. Each data node is assigned 8GB memory and 4GB swap. I also checked for the memeoy usage on data node when ndbd is running. The average memory usage is about 2.4GB.
The log on management node looks like:
=========================================================================
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 11: Node 13 Disconnected
2012-03-06 17:12:59 [MgmtSrvr] INFO -- Node 10: Communication to Node 13 closed
2012-03-06 17:12:59 [MgmtSrvr] INFO -- Node 11: Communication to Node 13 closed
2012-03-06 17:12:59 [MgmtSrvr] INFO -- Node 12: Communication to Node 13 closed
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 1: Node 13 Disconnected
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 10: Arbitration check won - node group majority
2012-03-06 17:12:59 [MgmtSrvr] INFO -- Node 10: President restarts arbitration thread [state=6]
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 10: Node 13 Disconnected
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 12: Node 13 Disconnected
2012-03-06 17:13:00 [MgmtSrvr] INFO -- Node 10: Local checkpoint 125 completed
2012-03-06 17:13:00 [MgmtSrvr] ALERT -- Node 13: Forced node shutdown completed. Caused by error 2315: 'Node declared dead. See error log for details(Arbitration error). Temporary error, restart node'.
2012-03-06 17:13:20 [MgmtSrvr] INFO -- Node 12: Communication to Node 13 opened
2012-03-06 17:13:20 [MgmtSrvr] INFO -- Node 10: Communication to Node 13 opened
2012-03-06 17:13:20 [MgmtSrvr] INFO -- Node 11: Communication to Node 13 opened
2012-03-06 18:11:12 [MgmtSrvr] INFO -- Node 10: Local checkpoint 126 started. Keep GCI = 191298 oldest restorable GCI = 190007
2012-03-06 18:11:59 [MgmtSrvr] INFO -- Node 10: Local checkpoint 126 completed
=============================================================================
And the error log on node13 looks like:
=============================================================================
Time: Tuesday 6 March 2012 - 17:12:59
Status: Temporary error, restart node
Message: Node declared dead. See error log for details (Arbitration error)
Error: 2315
Error data: We(13) have been declared dead by 11 (via 12) reason: Connection failure(5)
Error object: QMGR (Line: 3657) 0x00000002
Program: ndbd
Pid: 4274
Version: mysql-5.1.51 ndb-7.2.0-beta
Trace: /users1/mysql-cluster/ndb_13_trace.log.9
***EOM***
=============================================================================
And here is the config.ini
=============================================================================
[NDBD DEFAULT]
NoOfReplicas= 2
ServerPort= 2202
# Data Memory, Index Memory, and String Memory #
DataMemory= 1024M
IndexMemory= 256M
StringMemory= 5
MaxNoOfOrderedIndexes= 1024
MaxNoOfAttributes= 10000
MaxNoOfTables= 2500
MaxNoOfConcurrentOperations= 250000
MaxNoOfConcurrentIndexOperations= 250000
MaxNoOfFiredTriggers= 4000
TransactionBufferMemory= 1M
# Scans and buffering #
MaxNoOfConcurrentScans= 300
MaxNoOfLocalScans= 32
BatchSizePerLocalScan= 64
LongMessageBuffer= 1M
# Controlling Timeouts, Intervals, and Disk Paging #
TimeBetweenWatchDogCheck= 6000
TimeBetweenWatchDogCheckInitial= 6000
StartPartialTimeout= 30000
StartPartitionedTimeout= 60000
StartFailureTimeout= 1000000
HeartbeatIntervalDbDb= 5000
HeartbeatIntervalDbApi= 5000
TimeBetweenLocalCheckpoints= 20
TimeBetweenGlobalCheckpoints= 2000
TransactionInactiveTimeout= 0
TransactionDeadlockDetectionTimeout= 1200
ArbitrationTimeout= 3000
# Buffering and Logging #
UndoIndexBuffer= 2M
UndoDataBuffer= 1M
RedoBuffer= 32M
# Backup Parameters #
BackupDataBufferSize= 2M
BackupLogBufferSize= 2M
BackupMemory= 64M
BackupWriteSize= 32K
BackupMaxWriteSize= 256K
[MGM DEFAULT]
PortNumber= 1186
DataDir= /var/lib/mysql-cluster # Directory for this management node's pidfiles
[NDB_MGMD]
NodeId= 1
ArbitrationRank= 1
HostName= ndb-manager # Hostname or IP address of management node
DataDir= /var/lib/mysql-cluster # Directory for this management node's pidfiles
LogDestination= FILE:filename=/var/log/mysql-cluster/ndb_manager.log, maxsize=500000, maxfiles=4
[NDBD]
NodeId= 10
HostName= ndb-node1 # Hostname or IP address
DataDir= /users1/mysql-cluster # Directory for this data node's datafiles
[NDBD]
NodeId= 11
HostName= ndb-node2 # Hostname or IP address
DataDir= /users1/mysql-cluster # Directory for this data node's datafiles
[NDBD]
NodeId= 12
HostName= ndb-node3 # Hostname or IP address
DataDir= /users1/mysql-cluster # Directory for this data node's datafiles
[NDBD]
NodeId= 13
HostName= ndb-node4 # Hostname or IP address
DataDir= /users1/mysql-cluster # Directory for this data node's datafiles
#
# Note: The following can be MySQLD connections or
# NDB API application connecting to the cluster
#
[API]
NodeId= 20
HostName= ndb-node1
[API]
NodeId= 21
HostName= ndb-node2
[API]
NodeId= 22
HostName= ndb-node3
[API]
NodeId= 23
HostName= ndb-node4
============================================================================
Any ideas to fix it?
Thanks.
======================================================================
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=10 @172.16.7.52 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 0, Master)
id=11 @172.16.7.53 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 0)
id=12 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 1)
id=13 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 1)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @172.17.7.51 (mysql-5.1.51 ndb-7.2.0)
[mysqld(API)] 4 node(s)
id=20 @172.16.7.52 (mysql-5.1.51 ndb-7.2.0)
id=21 @172.16.7.53 (mysql-5.1.51 ndb-7.2.0)
id=22 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0)
id=23 @61.56.8.155 (mysql-5.1.51 ndb-7.2.0)
=======================================================================
My problem is that in all four data nodes, there is always one data node will be shutdown beacuse some kind of connection failure. After that, the cluster works stable with three data nodes remain. It looks like:
========================================================================
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=10 @172.16.7.52 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 0, Master)
id=11 @172.16.7.53 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 0)
id=12 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0, Nodegroup: 1)
id=13 (not connected, accepting connect from ndb-node4)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @172.17.7.51 (mysql-5.1.51 ndb-7.2.0)
[mysqld(API)] 4 node(s)
id=20 @172.16.7.52 (mysql-5.1.51 ndb-7.2.0)
id=21 @172.16.7.53 (mysql-5.1.51 ndb-7.2.0)
id=22 @61.56.8.154 (mysql-5.1.51 ndb-7.2.0)
id=23 @61.56.8.155 (mysql-5.1.51 ndb-7.2.0)
=========================================================================
After I restart the dead data node, cluster could run normally with all four data nodes for 2~3 hours. And then, it happens again. By the way, the dead data node is not always the same one.
I run all nodes in cluster on VM, OS is RHEL 5.5. Each data node is assigned 8GB memory and 4GB swap. I also checked for the memeoy usage on data node when ndbd is running. The average memory usage is about 2.4GB.
The log on management node looks like:
=========================================================================
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 11: Node 13 Disconnected
2012-03-06 17:12:59 [MgmtSrvr] INFO -- Node 10: Communication to Node 13 closed
2012-03-06 17:12:59 [MgmtSrvr] INFO -- Node 11: Communication to Node 13 closed
2012-03-06 17:12:59 [MgmtSrvr] INFO -- Node 12: Communication to Node 13 closed
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 1: Node 13 Disconnected
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 10: Arbitration check won - node group majority
2012-03-06 17:12:59 [MgmtSrvr] INFO -- Node 10: President restarts arbitration thread [state=6]
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 10: Node 13 Disconnected
2012-03-06 17:12:59 [MgmtSrvr] ALERT -- Node 12: Node 13 Disconnected
2012-03-06 17:13:00 [MgmtSrvr] INFO -- Node 10: Local checkpoint 125 completed
2012-03-06 17:13:00 [MgmtSrvr] ALERT -- Node 13: Forced node shutdown completed. Caused by error 2315: 'Node declared dead. See error log for details(Arbitration error). Temporary error, restart node'.
2012-03-06 17:13:20 [MgmtSrvr] INFO -- Node 12: Communication to Node 13 opened
2012-03-06 17:13:20 [MgmtSrvr] INFO -- Node 10: Communication to Node 13 opened
2012-03-06 17:13:20 [MgmtSrvr] INFO -- Node 11: Communication to Node 13 opened
2012-03-06 18:11:12 [MgmtSrvr] INFO -- Node 10: Local checkpoint 126 started. Keep GCI = 191298 oldest restorable GCI = 190007
2012-03-06 18:11:59 [MgmtSrvr] INFO -- Node 10: Local checkpoint 126 completed
=============================================================================
And the error log on node13 looks like:
=============================================================================
Time: Tuesday 6 March 2012 - 17:12:59
Status: Temporary error, restart node
Message: Node declared dead. See error log for details (Arbitration error)
Error: 2315
Error data: We(13) have been declared dead by 11 (via 12) reason: Connection failure(5)
Error object: QMGR (Line: 3657) 0x00000002
Program: ndbd
Pid: 4274
Version: mysql-5.1.51 ndb-7.2.0-beta
Trace: /users1/mysql-cluster/ndb_13_trace.log.9
***EOM***
=============================================================================
And here is the config.ini
=============================================================================
[NDBD DEFAULT]
NoOfReplicas= 2
ServerPort= 2202
# Data Memory, Index Memory, and String Memory #
DataMemory= 1024M
IndexMemory= 256M
StringMemory= 5
MaxNoOfOrderedIndexes= 1024
MaxNoOfAttributes= 10000
MaxNoOfTables= 2500
MaxNoOfConcurrentOperations= 250000
MaxNoOfConcurrentIndexOperations= 250000
MaxNoOfFiredTriggers= 4000
TransactionBufferMemory= 1M
# Scans and buffering #
MaxNoOfConcurrentScans= 300
MaxNoOfLocalScans= 32
BatchSizePerLocalScan= 64
LongMessageBuffer= 1M
# Controlling Timeouts, Intervals, and Disk Paging #
TimeBetweenWatchDogCheck= 6000
TimeBetweenWatchDogCheckInitial= 6000
StartPartialTimeout= 30000
StartPartitionedTimeout= 60000
StartFailureTimeout= 1000000
HeartbeatIntervalDbDb= 5000
HeartbeatIntervalDbApi= 5000
TimeBetweenLocalCheckpoints= 20
TimeBetweenGlobalCheckpoints= 2000
TransactionInactiveTimeout= 0
TransactionDeadlockDetectionTimeout= 1200
ArbitrationTimeout= 3000
# Buffering and Logging #
UndoIndexBuffer= 2M
UndoDataBuffer= 1M
RedoBuffer= 32M
# Backup Parameters #
BackupDataBufferSize= 2M
BackupLogBufferSize= 2M
BackupMemory= 64M
BackupWriteSize= 32K
BackupMaxWriteSize= 256K
[MGM DEFAULT]
PortNumber= 1186
DataDir= /var/lib/mysql-cluster # Directory for this management node's pidfiles
[NDB_MGMD]
NodeId= 1
ArbitrationRank= 1
HostName= ndb-manager # Hostname or IP address of management node
DataDir= /var/lib/mysql-cluster # Directory for this management node's pidfiles
LogDestination= FILE:filename=/var/log/mysql-cluster/ndb_manager.log, maxsize=500000, maxfiles=4
[NDBD]
NodeId= 10
HostName= ndb-node1 # Hostname or IP address
DataDir= /users1/mysql-cluster # Directory for this data node's datafiles
[NDBD]
NodeId= 11
HostName= ndb-node2 # Hostname or IP address
DataDir= /users1/mysql-cluster # Directory for this data node's datafiles
[NDBD]
NodeId= 12
HostName= ndb-node3 # Hostname or IP address
DataDir= /users1/mysql-cluster # Directory for this data node's datafiles
[NDBD]
NodeId= 13
HostName= ndb-node4 # Hostname or IP address
DataDir= /users1/mysql-cluster # Directory for this data node's datafiles
#
# Note: The following can be MySQLD connections or
# NDB API application connecting to the cluster
#
[API]
NodeId= 20
HostName= ndb-node1
[API]
NodeId= 21
HostName= ndb-node2
[API]
NodeId= 22
HostName= ndb-node3
[API]
NodeId= 23
HostName= ndb-node4
============================================================================
Any ideas to fix it?
Thanks.