Quantcast
Channel: MySQL Forums - NDB clusters
Viewing all articles
Browse latest Browse all 1560

NDB data node failed and sent another data node into a crash loop (2 replies)

$
0
0
I have an NDB cluster running 3 data nodes and 3 management nodes.

It appears that an error occured on 1 data node which caused it to restart

Node 19:
2022-01-15 04:20:46 [ndbd] INFO -- findNeighbours from: 2905 old (left: 17 right: 17) new (17 18)
2022-01-15 04:20:47 [ndbd] INFO -- NR Status: node=18,OLD=Node failure handling complete,NEW=All nodes permitted us
2022-01-15 04:20:47 [ndbd] INFO -- Switch to 17 multi trp for node 18
2022-01-15 04:21:24 [ndbd] INFO -- NR Status: node=18,OLD=All nodes permitted us,NEW=Include node in LCP/GCP protocols
2022-01-15 04:21:24 [ndbd] INFO -- NR Status: node=18,OLD=Include node in LCP/GCP protocols,NEW=Synchronize start node with live nodes
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::execCOPY_FRAGREQ(Signal*)+0xa29) [0x65f949]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x230) [0x8a11e0]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7f7b742a3ea5]
/lib64/libc.so.6(clone+0x6d) [0x7f7b72ccb96d]
2022-01-15 04:21:25 [ndbd] INFO -- /var/lib/pb2/sb_1-2918142-1619218179.52/rpm/BUILD/mysql-cluster-com-8.0.25/mysql-cluster-com-8.0.25/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2022-01-15 04:21:25 [ndbd] INFO -- DBLQH (Line: 19166) 0x00000002 Check getFragmentrec(fragId) failed
2022-01-15 04:21:25 [ndbd] INFO -- Error handler shutting down system
2022-01-15 04:21:26 [ndbd] ALERT -- Node 19: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2022-01-15 04:21:27 [ndbd] INFO -- Angel pid: 34452 started child: 34453


Another data node then failed at the same time and was put into a crash loop

Node 18:
2022-01-15 04:21:25 [ndbd] INFO -- LDM(8): Completed copy of fragment T175F3. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2022-01-15 04:21:26 [ndbd] INFO -- Node 19 disconnected in state: 0
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Qmgr::execDISCONNECT_REP(Signal*)+0x21f) [0x79be7f]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc05]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7efd8a3e9ea5]
/lib64/libc.so.6(clone+0x6d) [0x7efd88e1196d]
2022-01-15 04:21:26 [ndbd] INFO -- Node 19 disconnected in state: 0
2022-01-15 04:21:26 [ndbd] INFO -- Node 19 disconnected in phase: 3
2022-01-15 04:21:26 [ndbd] INFO -- QMGR (Line: 4245) 0x00000002
2022-01-15 04:21:26 [ndbd] INFO -- Error handler shutting down system
2022-01-15 04:21:26 [ndbd] ALERT -- Node 18: Forced node shutdown completed. Occurred during startphase 5. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
2022-01-15 04:21:27 [ndbd] INFO -- Angel pid: 14167 started child: 14168

It appears that the 2 nodes were caught in some sort of contention where one would crash and then the other would crash

Node 19:
2022-01-15 13:33:22 [ndbd] INFO -- Node 18 disconnected in state: 0
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Qmgr::failReportLab(Signal*, unsigned short, FailRep::FailCause, unsigned short)+0x96d) [0x7956ad]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc05]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7ff4b634bea5]
/lib64/libc.so.6(clone+0x6d) [0x7ff4b4d7396d]
2022-01-15 13:33:22 [ndbd] INFO -- Node 18 failed
2022-01-15 13:33:22 [ndbd] INFO -- QMGR (Line: 5039) 0x00000002
2022-01-15 13:33:22 [ndbd] INFO -- Error handler shutting down system
2022-01-15 13:33:22 [ndbd] ALERT -- Node 19: Forced node shutdown completed. Occurred during startphase 2. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

Node 18:
22-01-15 13:33:02 [ndbd] INFO -- (16), tab(6,3), lcpNo: 65535, m_max_restorable_gci: 2429, crestartNewestGci: 2430, srStartGci: 0
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
stack_bottom = 0 thread_stack 0x0
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x8c676d]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x2f) [0x81f59f]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x887ca9]
/usr/sbin/ndbmtd(Dblqh::send_restore_lcp(Signal*)+0x9d9) [0x624b59]
/usr/sbin/ndbmtd() [0x89604c]
/usr/sbin/ndbmtd() [0x89bc98]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4c9) [0x8a1479]
/usr/sbin/ndbmtd() [0x869366]
/lib64/libpthread.so.0(+0x7ea5) [0x7fe8c2222ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fe8c0c4a96d]
2022-01-15 13:33:02 [ndbd] INFO -- /var/lib/pb2/sb_1-2918142-1619218179.52/rpm/BUILD/mysql-cluster-com-8.0.25/mysql-cluster-com-8.0.25/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2022-01-15 13:33:02 [ndbd] INFO -- DBLQH (Line: 27173) 0x00000002 Check c_local_sysfile.m_max_restorable_gci >= crestartNewestGci failed
2022-01-15 13:33:02 [ndbd] INFO -- Error handler shutting down system
2022-01-15 13:33:02 [ndbd] ALERT -- Node 18: Forced node shutdown completed. Occurred during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Eventually Node 19 was able to correct itself and start properly
Node 18 however was stuck in a crash loop and eventually taken offline

Can you advise on how to recover from this state?

Viewing all articles
Browse latest Browse all 1560

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>