Hi everybody,
I've got a strange problem where someone maybe can push me in the right direction. My configuration:
2 datanodes
2 management server
4 mysql server
This is the output of ndb_mgm:
ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=11 @192.168.xxx.134 (mysql-5.1.56 ndb-7.1.15, Nodegroup: 0, Master)
id=12 @192.168.xxx.135 (mysql-5.1.56 ndb-7.1.15, Nodegroup: 0)
[ndb_mgmd(MGM)] 2 node(s)
id=1 @192.168.xxx.130 (mysql-5.1.56 ndb-7.1.15)
id=2 @192.168.xxx.131 (mysql-5.1.56 ndb-7.1.15)
[mysqld(API)] 6 node(s)
id=21 @192.168.xxx.132 (mysql-5.1.51 ndb-7.1.9)
id=22 @192.168.xxx.133 (mysql-5.1.51 ndb-7.1.9)
id=23 @192.168.xxx.136 (mysql-5.1.51 ndb-7.1.9)
id=24 @192.168.xxx.137 (mysql-5.1.51 ndb-7.1.9)
id=31 (not connected, accepting connect from 192.168.xxx.134)
id=32 (not connected, accepting connect from 192.168.xxx.135)
ndb_mgm>
The server are all virtual server in a 2 server VMware ESXi 4.1 environment. This means that 1 datanode, 1 management server and 2 mysql server (along with an apache webserver) are depoyed on each physical server (and yes, I know that using VMware with MySQL Cluster is not recommended... :-) )
My problem occurred last week. In the cluster there is a table with 66 rows. I had round about 150000 rows in this table. Then I wanted to delete all the rows so I made a
mysql> delete * from spenden;
(I can't use truncate because I need the autoincrement columns intact)
About 90 seconds later I got an error in the cluster and it crashed. I was able to restart the cluster and all the data in the table I wanted to delete was there again so I did it a second time - with the same result. The cluster crashed but I was able to restart it. I got the following error message in the cluster log:
=======================
ndb_1_cluster.log
=======================
2013-06-06 11:02:45 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=1224KB(100%) alloc=1224KB(0%) max=0B apply_epoch=26363618/15 latest_epoch=26363618/15
2013-06-06 11:02:45 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=1598KB(100%) alloc=1598KB(0%) max=0B apply_epoch=26363618/16 latest_epoch=26363618/16
2013-06-06 11:02:45 [MgmtSrvr] INFO -- Node 22: Event buffer status: used=1029KB(99%) alloc=1029KB(0%) max=0B apply_epoch=26363618/18 latest_epoch=26363618/18
2013-06-06 11:02:45 [MgmtSrvr] INFO -- Node 21: Event buffer status: used=1159KB(100%) alloc=1159KB(0%) max=0B apply_epoch=26363619/0 latest_epoch=26363619/0
2013-06-06 11:02:45 [MgmtSrvr] INFO -- Node 22: Event buffer status: used=378KB(26%) alloc=1415KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/4
2013-06-06 11:02:45 [MgmtSrvr] INFO -- Node 21: Event buffer status: used=378KB(27%) alloc=1391KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/4
2013-06-06 11:02:45 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=2538KB(79%) alloc=3206KB(0%) max=0B apply_epoch=26363619/1 latest_epoch=26363619/1
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=2038KB(63%) alloc=3186KB(0%) max=0B apply_epoch=26363619/2 latest_epoch=26363619/2
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=287KB(8%) alloc=3333KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/7
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=287KB(8%) alloc=3333KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/8
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=287KB(8%) alloc=3333KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/9
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=287KB(8%) alloc=3333KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/10
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=287KB(8%) alloc=3333KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/11
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=378KB(11%) alloc=3259KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/7
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=378KB(11%) alloc=3259KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/8
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=378KB(11%) alloc=3259KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/9
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=378KB(11%) alloc=3259KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/10
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=378KB(11%) alloc=3259KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/11
2013-06-06 11:02:46 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=378KB(11%) alloc=3259KB(0%) max=0B apply_epoch=26363619/4 latest_epoch=26363619/12
2013-06-06 11:05:28 [MgmtSrvr] INFO -- Node 21: Event buffer status: used=1880KB(100%) alloc=1880KB(0%) max=0B apply_epoch=26363698/2 latest_epoch=26363698/2
2013-06-06 11:05:28 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=3731KB(100%) alloc=3731KB(0%) max=0B apply_epoch=26363698/2 latest_epoch=26363698/2
2013-06-06 11:05:28 [MgmtSrvr] INFO -- Node 22: Event buffer status: used=1904KB(100%) alloc=1904KB(0%) max=0B apply_epoch=26363698/2 latest_epoch=26363698/2
2013-06-06 11:05:28 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=3809KB(100%) alloc=3809KB(0%) max=0B apply_epoch=26363698/2 latest_epoch=26363698/2
2013-06-06 11:05:29 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=0B(0%) alloc=4052KB(0%) max=0B apply_epoch=26363698/4 latest_epoch=26363698/4
2013-06-06 11:05:29 [MgmtSrvr] INFO -- Node 21: Event buffer status: used=0B(0%) alloc=2202KB(0%) max=0B apply_epoch=26363698/4 latest_epoch=26363698/4
2013-06-06 11:05:29 [MgmtSrvr] INFO -- Node 22: Event buffer status: used=0B(0%) alloc=2225KB(0%) max=0B apply_epoch=26363698/4 latest_epoch=26363698/4
2013-06-06 11:05:29 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=0B(0%) alloc=4131KB(0%) max=0B apply_epoch=26363698/5 latest_epoch=26363698/5
2013-06-06 16:05:29 [MgmtSrvr] WARNING -- Node 11: Transporter to node 12 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:29 [MgmtSrvr] INFO -- Node 21: Event buffer status: used=8427KB(100%) alloc=8427KB(0%) max=0B apply_epoch=26372414/7 latest_epoch=26372414/7
2013-06-06 16:05:29 [MgmtSrvr] INFO -- Node 23: Event buffer status: used=10439KB(100%) alloc=10439KB(0%) max=0B apply_epoch=26372414/7 latest_epoch=26372414/7
2013-06-06 16:05:29 [MgmtSrvr] INFO -- Node 24: Event buffer status: used=10227KB(100%) alloc=10227KB(0%) max=0B apply_epoch=26372414/7 latest_epoch=26372414/7
2013-06-06 16:05:29 [MgmtSrvr] INFO -- Node 22: Event buffer status: used=8575KB(100%) alloc=8575KB(0%) max=0B apply_epoch=26372414/7 latest_epoch=26372414/7
2013-06-06 16:05:30 [MgmtSrvr] WARNING -- Node 11: Transporter to node 12 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:30 [MgmtSrvr] WARNING -- Node 11: Transporter to node 12 reported error 0x16: The send buffer was full, but sleeping for a while solved - Repeated 3 times
2013-06-06 16:05:30 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:31 [MgmtSrvr] WARNING -- Node 12: Transporter to node 22 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:31 [MgmtSrvr] WARNING -- Node 11: Transporter to node 12 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:31 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:31 [MgmtSrvr] WARNING -- Node 11: Transporter to node 12 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:31 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:31 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:38 [MgmtSrvr] INFO -- Node 11: Out of event buffer: nodefailure will cause event failures
2013-06-06 16:05:38 [MgmtSrvr] WARNING -- Node 12: Transporter to node 22 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:38 [MgmtSrvr] INFO -- Node 12: Out of event buffer: nodefailure will cause event failures
2013-06-06 16:05:38 [MgmtSrvr] WARNING -- Node 12: Transporter to node 22 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:38 [MgmtSrvr] WARNING -- Node 12: Transporter to node 22 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:39 [MgmtSrvr] WARNING -- Node 11: GCP Monitor: GCP_COMMIT lag 7 seconds (max lag: 13)
2013-06-06 16:05:39 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:39 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved - Repeated 2 times
2013-06-06 16:05:39 [MgmtSrvr] WARNING -- Node 11: Transporter to node 24 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:40 [MgmtSrvr] WARNING -- Node 11: Transporter to node 22 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:40 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:40 [MgmtSrvr] WARNING -- Node 11: Transporter to node 24 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:40 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:41 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved - Repeated 2 times
2013-06-06 16:05:41 [MgmtSrvr] WARNING -- Node 11: Transporter to node 22 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:41 [MgmtSrvr] WARNING -- Node 11: Transporter to node 23 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:42 [MgmtSrvr] WARNING -- Node 11: Transporter to node 22 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:45 [MgmtSrvr] WARNING -- Node 11: Transporter to node 22 reported error 0x16: The send buffer was full, but sleeping for a while solved - Repeated 5 times
2013-06-06 16:05:46 [MgmtSrvr] ALERT -- Node 1: Node 11 Disconnected
2013-06-06 16:05:46 [MgmtSrvr] WARNING -- Node 12: Transporter to node 22 reported error 0x16: The send buffer was full, but sleeping for a while solved
2013-06-06 16:05:46 [MgmtSrvr] ALERT -- Node 11: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2013-06-06 16:05:46 [MgmtSrvr] ALERT -- Node 12: Node 11 Disconnected
2013-06-06 16:05:46 [MgmtSrvr] INFO -- Node 12: Communication to Node 11 closed
2013-06-06 16:05:46 [MgmtSrvr] ALERT -- Node 12: Network partitioning - arbitration required
2013-06-06 16:05:46 [MgmtSrvr] INFO -- Node 12: President restarts arbitration thread [state=7]
2013-06-06 16:05:46 [MgmtSrvr] ALERT -- Node 12: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2013-06-06 16:05:46 [MgmtSrvr] ALERT -- Node 1: Node 12 Disconnected
2013-06-06 16:05:47 [MgmtSrvr] INFO -- Mgmt server state: nodeid 11 freed, m_reserved_nodes 1, 21, 22, 23 and 24.
2013-06-06 16:16:36 [MgmtSrvr] INFO -- Mgmt server state: nodeid 11 reserved for ip 192.168.129.134, m_reserved_nodes 1, 11, 21, 22, 23 and 24.
2013-06-06 16:16:37 [MgmtSrvr] INFO -- Node 1: Node 11 Connected
2013-06-06 16:16:38 [MgmtSrvr] INFO -- Node 11: Node 2 Connected
2013-06-06 16:16:54 [MgmtSrvr] INFO -- Mgmt server state: nodeid 12 reserved for ip 192.168.129.135, m_reserved_nodes 1, 11, 12, 21, 22, 23 and 24.
2013-06-06 16:16:55 [MgmtSrvr] INFO -- Node 1: Node 12 Connected
2013-06-06 16:16:55 [MgmtSrvr] INFO -- Node 12: Node 2 Connected
2013-06-06 16:17:02 [MgmtSrvr] INFO -- Node 11: Start initiated (mysql-5.1.56 ndb-7.1.15)
2013-06-06 16:17:04 [MgmtSrvr] INFO -- Node 11: Start phase 0 completed
2013-06-06 16:17:04 [MgmtSrvr] INFO -- Node 11: Communication to Node 12 opened
2013-06-06 16:17:04 [MgmtSrvr] INFO -- Node 11: Waiting 30 sec for nodes 12 to connect, nodes [ all: 11 and 12 connected: 11 no-wait: ]
(...)
=======================
ndb_11_out.log
=======================
2013-03-19 00:37:37 [ndbd] INFO -- timerHandlingLab now: 2693647860 sent: 2693647664 diff: 196
alloc_chunk(39460 16) -
alloc_chunk(39476 16) -
alloc_chunk(39492 16) -
alloc_chunk(39508 16) -
alloc_chunk(50530 16) -
alloc_chunk(50546 16) -
alloc_chunk(50562 16) -
alloc_chunk(50578 16) -
alloc_chunk(50594 16) -
alloc_chunk(50610 16) -
alloc_chunk(50626 16) -
alloc_chunk(50642 16) -
alloc_chunk(50658 16) -
alloc_chunk(50674 16) -
alloc_chunk(50690 16) -
alloc_chunk(50706 16) -
2013-06-06 16:05:31 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=100
2013-06-06 16:05:31 [ndbd] INFO -- Watchdog: User time: 1465347 System time: 4055724
2013-06-06 16:05:31 [ndbd] INFO -- timerHandlingLab now: 7239393501 sent: 7239393323 diff: 178
alloc_chunk(50722 16) -
alloc_chunk(50738 16) -
alloc_chunk(50754 16) -
alloc_chunk(50770 16) -
alloc_chunk(50786 16) -
alloc_chunk(50802 16) -
alloc_chunk(50818 16) -
alloc_chunk(50834 16) -
alloc_chunk(50850 16) -
alloc_chunk(50866 16) -
alloc_chunk(50882 16) -
alloc_chunk(50898 16) -
alloc_chunk(50914 16) -
alloc_chunk(50930 16) -
alloc_chunk(50946 16) -
alloc_chunk(50962 16) -
alloc_chunk(50978 16) -
alloc_chunk(50994 16) -
alloc_chunk(51010 16) -
alloc_chunk(51026 16) -
alloc_chunk(51042 16) -
alloc_chunk(51058 16) -
alloc_chunk(51074 16) -
alloc_chunk(51090 16) -
alloc_chunk(51106 16) -
alloc_chunk(51122 16) -
alloc_chunk(51138 16) -
alloc_chunk(51154 16) -
alloc_chunk(51170 16) -
alloc_chunk(51186 16) -
alloc_chunk(51202 16) -
alloc_chunk(51218 16) -
alloc_chunk(51234 16) -
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 161 m_gcp_save.m_max_lag: 1310
m_micro_gcp.m_counter: 131 m_micro_gcp.m_max_lag: 131
m_gcp_save.m_state: 0
m_gcp_save.m_master.m_state: 0
m_micro_gcp.m_state: 2
m_micro_gcp.m_master.m_state: 2
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=1 0000000000000800]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_SUB_GCP_COMPLETE_REP_Counter = [SignalCounter: m_count=0 0000000000000000]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=1 0000000000001000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 11 f6000b reason: 0 nextWord: 0
Detected GCP stop(2)...sending kill to [SignalCounter: m_count=1 0000000000000800]
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 0 m_gcp_save.m_max_lag: 1310
m_micro_gcp.m_counter: 0 m_micro_gcp.m_max_lag: 131
m_gcp_save.m_state: 0
m_gcp_save.m_master.m_state: 0
m_micro_gcp.m_state: 2
m_micro_gcp.m_master.m_state: 2
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=1 0000000000000800]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_SUB_GCP_COMPLETE_REP_Counter = [SignalCounter: m_count=0 0000000000000000]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=1 0000000000001000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 11 f6000b reason: 0 nextWord: 0
file[0] status: 2 type: 1 reqStatus: 0 file1: 2 1 0
c_nodeStartMaster.blockGcp: 0 4294967040
m_gcp_save.m_counter: 0 m_gcp_save.m_max_lag: 1310
m_micro_gcp.m_counter: 0 m_micro_gcp.m_max_lag: 131
m_gcp_save.m_state: 0
m_gcp_save.m_master.m_state: 0
m_micro_gcp.m_state: 2
m_micro_gcp.m_master.m_state: 2
c_COPY_GCIREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_COPY_TABREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_CREATE_FRAGREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_DIH_SWITCH_REPLICA_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_EMPTY_LCP_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_COMMIT_Counter = [SignalCounter: m_count=1 0000000000000800]
c_GCP_PREPARE_Counter = [SignalCounter: m_count=0 0000000000000000]
c_GCP_SAVEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_SUB_GCP_COMPLETE_REP_Counter = [SignalCounter: m_count=0 0000000000000000]
c_INCL_NODEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_GCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_MASTER_LCPREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_INFOREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_START_RECREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_STOP_ME_REQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TC_CLOPSIZEREQ_Counter = [SignalCounter: m_count=0 0000000000000000]
c_TCGETOPSIZEREQ_Counter = [SignalCounter: m_count=1 0000000000001000]
m_copyReason: 0 m_waiting: 0 0
c_copyGCISlave: sender{Data, Ref} 11 f6000b reason: 0 nextWord: 0
2013-06-06 16:05:45 [ndbd] INFO -- Node 11 killed this node because GCP stop was detected
2013-06-06 16:05:45 [ndbd] INFO -- NDBCNTR (Line: 276) 0x00000002
2013-06-06 16:05:45 [ndbd] INFO -- Error handler shutting down system
2013-06-06 16:05:45 [ndbd] INFO -- Error handler shutdown completed - exiting
2013-06-06 16:05:46 [ndbd] ALERT -- Node 11: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2013-06-06 16:16:37 [ndbd] INFO -- Angel pid: 26971 started child: 26972
2013-06-06 16:16:37 [ndbd] INFO -- Configuration fetched from '192.168.xxx.130:1186', generation: 2
NDBMT: non-mt
2013-06-06 16:16:37 [ndbd] INFO -- NDB Cluster -- DB node 11
2013-06-06 16:16:37 [ndbd] INFO -- mysql-5.1.56 ndb-7.1.15 --
2013-06-06 16:16:37 [ndbd] INFO -- WatchDog timer is set to 6000 ms
2013-06-06 16:16:37 [ndbd] INFO -- numa_set_interleave_mask(numa_all_nodes) : OK
2013-06-06 16:16:37 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 2062Mb initial: 2082Mb
Adding 493Mb to ZONE_LO (1,15759)
Instantiating DBSPJ instanceNo=0
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Adding 1591Mb to ZONE_LO (15760,50888)
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
WOPool::init(61, 9)
(...)
===============================================================
As said before I could reproduce the issue on the system. Because the server is in use I made a backup of the data to reproduce the issue on my local copy of the server. Because they are virtual maschines the test environment is nearly 100% identical to the prod system with the difference that I use VMware player to run all virtual server. Surprisingly I was not able to reproduce the problem in my test environment. The query took around 2:30 min but completet as expected.
Has anybody an idea what happens on my prod system? Why is there an overflow in the send buffer? How can I adjust the size of the buffer? (I found no configuration entry for this)
Thanks in advance!
Malte