Quantcast
Channel: MySQL Forums - NDB clusters
Viewing all articles
Browse latest Browse all 1562

Data node forced to shutdown on restart. Caused by error 2341 (no replies)

$
0
0
Hello everyone,

I have been running into this problem since yesterday, and somehow I just can't seem to get it to work.

Config:

I setup a total of 4 hosts where there are 2xData Nodes, and 2xMgmt Servers and 1 SQL Node (for now)

Software:
I setup MySQL NDB cluster mysql-5.7.20 ndb-7.6.4 on my Ubuntu 16.04 LTS 64-bit box. NUMA is turned off in grub by setting numa=off in grub.conf.


Server Setup:
FusionIO SX300 3.0T SSD
2x DELL POWEREDGE R820 4X E5-4650 2.7GHZ 8C 256GB RAM


What am I doing:

I setup the servers, and the database. I shutdown DN#1 with ndb_mgm -e "1 STOP". The DN shut down just fine, and DN#2 took over as master. However, when I try to start DN#1, I get this error.


Here are my logs:

ndb_11_error.log:

Time: Friday 30 March 2018 - 10:13:46
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 16139) 0x00000002 Check c_copy_fragment_in_progress failed
Program: ndbmtd
Pid: 5991 thr: 30
Version: mysql-5.7.20 ndb-7.6.4
Trace file name: ndb_11_trace.log.25_t30
Trace file path: /db/mysql-cluster/ndb_11_trace.log.25 [t1..t54]
***EOM***



ndb_11_out.log:

2018-03-30 10:23:24 [ndbd] WARNING -- Ndb kernel thread 18 is stuck in: Print Job Buffers at crash elapsed=200
2018-03-30 10:23:24 [ndbd] INFO -- Watchdog: User time: 15294 System time: 10100
2018-03-30 10:23:24 [ndbd] WARNING -- Ndb kernel thread 18 is stuck in: Print Job Buffers at crash elapsed=100
2018-03-30 10:23:24 [ndbd] INFO -- Watchdog: User time: 15294 System time: 10120
2018-03-30 10:23:24 [ndbd] WARNING -- Ndb kernel thread 18 is stuck in: Print Job Buffers at crash elapsed=100
2018-03-30 10:23:24 [ndbd] INFO -- Watchdog: User time: 15294 System time: 10140
2018-03-30 10:23:25 [ndbd] WARNING -- Ndb kernel thread 18 is stuck in: Print Job Buffers at crash elapsed=200
2018-03-30 10:23:25 [ndbd] INFO -- Watchdog: User time: 15294 System time: 10150
2018-03-30 10:23:26 [ndbd] ALERT -- Node 11: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.


ndb_12_out.log:


2018-03-30 10:11:19 [ndbd] INFO -- Master takeover started from 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 1: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 6: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 6: Step NF_BLOCK_HANDLE completed, failure handling for node 11 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 6: Inserting failed node 11 into takeover queue, length 1
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 5: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 5: Step NF_BLOCK_HANDLE completed, failure handling for node 11 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 6: GCP completion 203/10 waiting for node failure handling (1) to complete. Seizing record for GCP.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 8: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 8: Step NF_BLOCK_HANDLE completed, failure handling for node 11 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 2: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 10: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 10: Step NF_BLOCK_HANDLE completed, failure handling for node 11 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 13: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 13: Step NF_BLOCK_HANDLE completed, failure handling for node 11 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 12: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 12: Step NF_BLOCK_HANDLE completed, failure handling for node 11 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 12: GCP completion 203/10 waiting for node failure handling (1) to complete. Seizing record for GCP.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 12: Inserting failed node 11 into takeover queue, length 1
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 15: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 15: Step NF_BLOCK_HANDLE completed, failure handling for node 11 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 15: GCP completion 203/10 waiting for node failure handling (1) to complete. Seizing record for GCP.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 16: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 16: Step NF_BLOCK_HANDLE completed, failure handling for node 11 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 16: GCP completion 203/10 waiting for node failure handling (1) to complete. Seizing record for GCP.
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 16: Inserting failed node 11 into takeover queue, length 1
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 4: Started failure handling for node 11
2018-03-30 10:11:19 [ndbd] INFO -- DBTC 4: Step NF_BLOCK_HANDLE completed, failure handling for node 11 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.



My Config.ini

root@server103:/var/lib/mysql-cluster# cat /var/lib/mysql-cluster/config.ini
[NDB_MGMD DEFAULT]
#DataDir=/var/lib/mysql-cluster # Directory for the log files
DataDir=/db/mysql-cluster # Directory for the log files
#config-cache=0


[NDBD DEFAULT]
#Redundancy:
NoOfReplicas=2

# MULTI-THREADING OPTIONS
#Our data node config for 60 CPUs: #ldm=32 #tc=16 #send=4 #recv=4 #main=1 #io=1 #watchdog=1 #rep=1
ThreadConfig = ldm={count=32,cpubind=4-35,thread_prio=10,realtime=0,spintime=500},tc={count=16, cpubind=36-51,thread_prio=10,realtime=0,spintime=100},send={count=4,cpubind=52-55},recv={count=4,cpubind=56-59},main={cpubind=60},io={cpubind=61},watchdog={cpubind=62,realtime=0},rep={cpubind=63},idxbld={count=1,cpubind=3}


#LockExecuteThreadToCPU=1

#IMPORTANT: This is not necessary to be set if you're using ThreadConfig
#If you are planning to use MySQL Cluster 7.0's multithreaded version 'ndbmtd' then you need to add
#'MaxNoOfExecutionThreads' to the [NDBD DEFAULT] section in the cluster configuration.
#MaxNoOfExecutionThreads=56

# On systems with multiple CPUs, these parameters can be used to lock NDBCLUSTER
# threads to specific CPUs. Only applicable when ThreadConfig isn't used.
#LockMaintThreadsToCPU=0

#Listing 4-3. LogDestination Using FILE
#LogDestination = FILE:filename=ndb_{node_id}_cluster.log,maxsize=1024000,maxfiles=6

#InitialLogFileGroup = name=lg_1; undo_buffer_size=64M; undo1.log=150M; undo2.log=200M
#InitialTablespace = name=ts_1; extent_size=1M; data1.dat=1G; data2.dat=2G

#Memory Data Storage Options The following options are related to memory sizing. Strategy for memory sizing is not difficult;
#allocate memory as much as the system has unless the system causes memory swapping. Note that objects for schema and transaction
#processing also consume a certain amount of memory. It is important not to allocate memory to buffers in this section too much.
#Leave a margin for them.
#DataMemory (memory for records and ordered indexes)
DataMemory=128G

#IndexMemory (memory for Primary key hash index and unique hash index)
#Usually between 1/6 or 1/8 of the DataMemory is enough, but depends on the
#number of unique hash indexes (UNIQUE in table def)
#IndexMemory=64G

# Avoid Swapping:
# On Linux and Solaris systems, setting this parameter locks data node
# processes into memory. Doing so prevents them from swapping to disk,
# which can severely degrade cluster performance.
LockPagesInMainMemory=1

# Schema Object Options On MySQL NDB Cluster, metadata of schema objects is stored in fixed size arrays that are allocated at the
# startup of the data node. The maximum allowable number of various objects is configured by the following options.
# It is important to allocate a required size for each schema objects beforehand. Schema object design is covered in Chapter 18.

# Table related things
# MaxNoOfLocalScans=64
#MaxNoOfTables=4096
#MaxNoOfAttributes=24756
MaxNoOfOrderedIndexes=2048
#MaxNoOfUniqueHashIndexes=512
#MaxNoOfTriggers=14336
#StringMemory=25


# DATA NODE CONFIGURATION:

#RAM from the shared global memory is used for the UNDO_BUFFER when you create the log file group.
#In the configuration generated by severalnines.com/config then you have to uncomment the
# SharedGlobalMemory in mysqlcluster-XYZ/cluster/config/config.ini before you start the cluster.
SharedGlobalMemory=8G

#If you are relying a lot on the disk data, we recommend to set this to as much as possible.
#In the configuration generated by severalnines.com/config then you have to uncomment the
#DiskPageBufferMemory in mysqlcluster-63/cluster/config/config.ini before you start the cluster.
#The DiskPageBufferMemory should be set to:
#DiskPageBufferMemory=TOTAL_RAM - 1GB (OS) - 1200MB (approx size used for buffers etc in the data nodes) - DataMemory - IndexMemory
#Expect to do some trial and terror before getting this correct.
DiskPageBufferMemory=8G



# TRANSACTION OPTIONS:
# Since MySQL NDB Cluster is a real-time database system, it doesn’t allocate memory on the fly.
# Instead, it allocates memory at startup. It includes various types of buffers used by transactions and data operations.

# Operation records
# MaxNoOfConcurrentOperations 100000 (min) means that you can load any mysqldump file into cluster.
MaxNoOfConcurrentOperations=250000
MaxNoOfConcurrentTransactions=16384
MaxNoOfConcurrentScans=500
#MaxNoOfLocalScans=4 * MaxNoOfConcurrentScans * [# data nodes] + 2
MaxNoOfLocalScans=4000
MaxParallelScansPerFragment=512
TransactionDeadlockDetectionTimeout=5000

# Transaction Temporary Storage #
MaxNoOfConcurrentIndexOperations=8192
MaxNoOfFiredTriggers=4000

#Data Files Storage
#FileSystemPathDD - MySQL Cluster Disk Data data files and undo log files are placed in the indicated directory.
#FileSystemPathDataFiles - MySQL Cluster Disk Data data files are placed in the indicated directory.
#FileSystemPathUndoFiles - MySQL Cluster Disk Data undo log files are placed in the indicated directory
#DataDir=/usr/local/mysql/data # Remote directory for the data files
#FileSystemPathUndoFiles=/storage/data/mysqlcluster/
#FileSystemPathDataFiles=/storage/data/mysqlcluster/
#BackupDataDir=/storage/data/mysqlcluster/backup/
DataDir=/db/mysql-cluster

#Setting these to system default
TimeBetweenWatchDogCheck= 60000
#ArbitrationTimeout=5000

#Bypass FS cache (you should test if this works for you or not)
#Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test.
# When this option is true, it causes write operations for checkpoints to be
# done in O_DIRECT mode, which means direct I/O. As the name suggests,
# direct I/O is an I/O operation done directly without routing file system cache.
# It may save certain CPU resources. It is best to set this option to true on
# Linux systems using kernel 2.6 or later.
ODirect=1

#Checkpointing...
#DiskCheckpointSpeed=10M
#TimeBetweenGlobalCheckpoints=1000
#the default value for TimeBetweenLocalCheckpoints is very good
#TimeBetweenLocalCheckpoints=20

#This option determines the speed of write operation for checkpoints in the
#amount of data written per second during a local checkpoint as part of a
#restart operation. This option is deprecated on 7.4.1 and removed on the
#7.5 series. Use MaxDiskWriteSpeedOtherNodeRestart and MaxDiskWriteSpeedOwnRestart
#instead on the 7.4.1 or newer series. On the 7.4 series, which is newer than or
#equal to 7.4.1, this option can be set but it has no effect.
#DiskCheckpointSpeedInRestart=100M


### Params for LCP
#MinDiskWriteSpeed=10M
#MaxDiskWriteSpeed=20M
#MaxDiskWriteSpeedOtherNodeRestart=500M
#MaxDiskWriteSpeedOwnRestart=200M
#TimeBetweenLocalCheckpoints=20
#TimeBetweenGlobalCheckpoints=2000
#TimeBetweenEpochs=100

#MemReportFrequency=30
#BackupReportFrequency=10

### Params for increasing Disk throughput
#BackupMaxWriteSize=1M
#BackupDataBufferSize=16M
#BackupLogBufferSize=4M


### Watchdog
#TimeBetweenWatchdogCheckInitial=60000

### TransactionInactiveTimeout - should be enabled in Production
TransactionInactiveTimeout=60000

### CGE 6.3 - REALTIME EXTENSIONS
# Setting these parameters allows you to take advantage of real-time scheduling
# of NDB threads to achieve increased throughput when using ndbd. They
# are not needed when using ndbmtd; in particular, you should not set
# RealTimeScheduler for ndbmtd data nodes.
RealTimeScheduler=0
#SchedulerExecutionTimer=80
#SchedulerSpinTimer=400
#SchedulerExecutionTimer=100


#RedoBuffer of 32M should let you restore/provision quite a lot of data in parallel.
#If you still have problems ("out of redobuffer"), then you probably have to slow disks and
#increasing this will not help, but only postpone the inevitable.
RedoBuffer=64M

### New 7.1.10 redo logging parameters
RedoOverCommitCounter=3
RedoOverCommitLimit=20

### Params for REDO LOG

# This is only useful when ThreadConfig isn't configured
NoOfFragmentLogParts = 32

#size of each redo log fragment, 4 redo log fragment makes up on fragment log file.
# A bigger Fragment log file size thatn the default 16M works better with high write load
# and is strongly recommended!!
# This option specifies size of each redo log file. See NoOfFragmentLogFiles
# for more information. If you need more redo log space, consider increasing
# this option first, because each log file needs a memory buffer.
#FragmentLogFileSize = 16M
FragmentLogFileSize=64M
InitFragmentLogFiles=SPARSE

# Set NoOfFragmentLogFiles to 6xDataMemory [in MB]/(4 *FragmentLogFileSize [in MB]
# Thus, NoOfFragmentLogFiles=6*2048/1024=12
# The "6xDataMemory" is a good heuristic and is STRONGLY recommended.
# ---
# This option specifies the number of redo log files. The redo log is written
# in a circular fashion. See Chapter 2 for more information about the redo log.
# The total file size of the redo log is calculated using the following formula:
# NoOfFragmentLogFiles * NoOfFragmentLogParts * FragmentLogFileSize
# The default values for these options are 16, 4, and 16M. 16 * 4 * 16M = 1G is the default for total size of the redo log.
#NoOfFragmentLogFiles=<4-6> X DataMemory in MB / 4 x FragmentLogFileSize
# NoOfFragmentLogFiles = 4 --> ### NoOfFragmentLogParts = <<No of LDM>>
NoOfFragmentLogFiles=300

TransactionBufferMemory=8M
#TimeBetweenGlobalCheckpoints=1000
#TimeBetweenEpochs=100
#TimeBetweenEpochsTimeout=0

### Heartbeating
#HeartbeatIntervalDbDb=15000
#HeartbeatIntervalDbApi=15000

### Params for setting logging
MemReportFrequency=30
BackupReportFrequency=10
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15

### Params for BACKUP
#BackupMaxWriteSize=1M
#BackupDataBufferSize=24M
#BackupLogBufferSize=16M
#BackupMemory=40M


# If you use MySQL Cluster 6.3 (CGE 6.3) and are tight on disk space, e.g ATCA.
# You should also then lock cpu's to a particular core.
# When this option is true, it causes LCP to be stored in compressed format.
# It saves certain disk space, but consumes more CPU time upon LCP and restart.
# It is better not to compress LCP on a busy system. CPU resources should be
# reserved for transaction processing. It is not recommended to set this option
# different per data node. Available resources should be the same among all data
# nodes to avoid bottlenecks.
#CompressedLCP=1
#CompressedBackup=1

#Realtime extensions (only in MySQL Cluster 6.3 (CGE 6.3) , read this how to use this)
#LockMaintThreadsToCPU=[cpuid]
#LockExecuteThreadToCPU=[cpuid]

LcpScanProgressTimeout=300
LongMessageBuffer=512MB


[tcp default]
SendBufferMemory=2M
ReceiveBufferMemory=2M


# Management node 1
[NDB_MGMD]
NodeId=1
ArbitrationRank=1
HostName=192.168.1.205 # Hostname of the manager
LogDestination=FILE:filename=ndb_1_cluster.log,maxsize=10000000,maxfiles=10


# Management node 2 for redundancy
[NDB_MGMD]
NodeId=2
ArbitrationRank=2
HostName=192.168.1.207 # Hostname of the manager
LogDestination=FILE:filename=ndb_2_cluster.log,maxsize=10000000,maxfiles=10


[NDBD]
NodeId=11
NodeGroup=0
HostName=192.168.1.211 # Hostname of the first data node


[NDBD]
NodeId=12
NodeGroup=0
HostName=192.168.1.213 # Hostname of the second data node


#[NDBD]
#NodeId=13
#HostName=192.168.1.215 # Hostname of the third data node


#[NDBD]
#NodeId=14
#HostName=192.168.1.217 # Hostname of the fourth data node


[MYSQLD]
NodeId = 51
HostName = 192.168.1.95

[MYSQLD]
NodeId = 52
HostName = 192.168.1.205

[MYSQLD]
NodeId = 53
HostName = 192.168.1.207

[MYSQLD]
NodeId = 54
HostName = 192.168.1.221


There are a total of about 65 tables. All of them are disk storage. However, they're all empty. I have created a master log file group, and about 16 tablespaces to handle all of these tables.

Please advise what I may be doing wrong, and perhaps what config options should I change to get this to work?

Thanks so much!
Basant

Viewing all articles
Browse latest Browse all 1562

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>