http://www.mysqlkorea.co.kr
한글매뉴얼 5.0 , 한글매뉴얼 5.1 , MySQL 5.1 HA , 사용자매뉴얼
공지사항  
뉴스  
질문과 답변
DBA
Developer
Cluster
토크박스  
이벤트  
서포트 티켓  
최신글
인텍스 추가 등에…
mysql master - s…
다대다 관계에서 …
mysql my파일과 …
mysql server 설…
 
질문과 답변 (Cluster) > 커뮤니티 존 > 질문과 답변 (Cluster)
ndbd start 또는 --initial 때
글쓴이 : 사악쫑이   날짜 : 12-06-13 08:34   조회수 : 6496
안녕하세요.
 
ndbcluster : version 7.2.5
ndb_mgmd * 2
mysqld * 2
ndbd * 2 로 구성되어 있습니다.
(tablespace 10g 파일 10개, undo log 파일 1G 2개)
 
몇일전 ndbd 2개가 모두 비정상 종료가 되어서
ndbd를 하나씩 start 했습니다.
첫번째것은 잘 올라왔으나 두번째 것은
아래와 같이 phase 5번을 진행 중에 오류를 뱉고 죽네요.
 
그래서 --initial로 도 실행해 보았으나 결과는 같습니다.
 
==================================================
ndb_mgm> 1 status
Node 1: starting (Last completed phase 4) (mysql-5.5.20 ndb-7.2.5)
==================================================
Time: Wednesday 13 June 2012 - 07:02:07
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Killed by node 1 as copyfrag failed, error: 1501
Error object: NDBCNTR (Line: 277) 0x00000002
Program: ndbd
Pid: 1745
Version: mysql-5.5.20 ndb-7.2.5
Trace: /var/mysql-cluster/ndb_data/ndb_1_trace.log.16 [t1..t1]
***EOM***
==================================================
 
 
좀 찾아보니 checkpoint에 문제가 있어서 안뜨는 것일 수도 있다는 버그 레포트가 있길래
2 dump 7098 0 0 이란것을 실행해보고 다시 돌려 봤어도 마찬가지입니다.
 
살릴수 있는 방법이 없을까요?
좋은 의견 부탁드립니다.
 

 
royster
버그는 아닌것으로 추정되구요...(느낌상 왠지^^)

아래 2가지 로그의 장해발생 시간때 부분을 올려주세요..
1. API(SQL) NODE의 hostname.err 파일의 장해시간때...부분
2. MGM NODE의 ndb_mgm노드아이디_cluster.log 파일의 장해시간때 ..부분
사악쫑이
답변 감사드립니다.
말씀하신 로그 부분 입니다.

=======================================================================
비정상 종료시점.
=======================================================================
server.err
=======================================================================
120611 10:23:38 [ERROR] Got error 489 when reading table './serverqna/server_com_category'
120611 10:30:29 [ERROR] /usr/local/mysql/bin/mysqld: Sort aborted: Got temporary error 489 'Too many active scans' from NDBCLUSTER
120611 10:33:47 [ERROR] Got error 489 when reading table './server/board_category'
120611 10:50:06 [ERROR] /usr/local/mysql/bin/mysqld: Lock wait timeout exceeded; try restarting transaction
120611 10:50:06 [ERROR] /usr/local/mysql/bin/mysqld: Sort aborted: Lock wait timeout exceeded; try restarting transaction
120611 10:53:03 [Note] NDB Binlog: RENAME Event: REPL$server/board_qna_main
120611 10:53:03 [Note] NDB Binlog: logging ./server/board_qna_main (UPDATED,USE_WRITE)
120611 10:53:04 [Note] NDB Binlog: drop table ./server/#sql2-759e-1d22.
120611 11:02:03 [ERROR] Got error 489 when reading table './server/board_category'
120611 11:02:08 [ERROR] Got error 489 when reading table './server/board_category'
120611 11:02:12 [ERROR] Got error 489 when reading table './server/board_category'
120611 11:02:17 [ERROR] Got error 489 when reading table './server/board_category'
120611 11:02:24 [ERROR] Got error 489 when reading table './server/board_category'
120611 11:10:12 [Note] NDB Binlog: Node: 1, down, Subscriber bitmask 00
120611 11:10:12 [Note] NDB Binlog: Node: 2, down, Subscriber bitmask 00
120611 11:10:12 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema at epoch 1737199/0.
120611 11:10:12 [Note] NDB Binlog: ndb tables initially read only on reconnect.
120611 11:10:12 [Note] NDB Binlog: cluster failure for ./server/board_qna_main at epoch 1737199/0.
120611 11:10:12 [Note] NDB Binlog: cluster failure for ./serverstatistics/ecb_sts_user_rank at epoch 1737199/0.
=======================================================================
server.cluster.log
=======================================================================
2012-06-11 10:55:45 [MgmtSrvr] INFO    -- Node 1: Local checkpoint 1590 started. Keep GCI = 1736092 oldest restorable GCI = 1736346
2012-06-11 11:01:26 [MgmtSrvr] INFO    -- Node 1: Local checkpoint 1590 completed
2012-06-11 11:10:12 [MgmtSrvr] ALERT    -- Node 51: Node 1 Disconnected
2012-06-11 11:10:12 [MgmtSrvr] ALERT    -- Node 1: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-06-11 11:10:12 [MgmtSrvr] ALERT    -- Node 2: Node 1 Disconnected
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: Communication to Node 1 closed
2012-06-11 11:10:12 [MgmtSrvr] ALERT    -- Node 2: Network partitioning - arbitration required
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: President restarts arbitration thread [state=7]
2012-06-11 11:10:12 [MgmtSrvr] ALERT    -- Node 2: Arbitration won - positive reply from node 51
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: GCP Take over started
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: Node 2 taking over as DICT master
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: GCP Monitor: unlimited lags allowed
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: GCP Take over completed
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: kk: 1737198/2 0 0
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: Pending schema transaction 1503 will be rolled back
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: LCP Take over started
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: ParticipatingDIH = 0000000000000000
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: ParticipatingLQH = 0000000000000000
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: m_LCP_COMPLETE_REP_From_Master_Received = 1
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: LCP Take over completed (state = 4)
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: ParticipatingDIH = 0000000000000000
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: ParticipatingLQH = 0000000000000000
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2012-06-11 11:10:12 [MgmtSrvr] INFO    -- Node 2: m_LCP_COMPLETE_REP_From_Master_Received = 1
2012-06-11 11:10:13 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-06-11 11:10:13 [MgmtSrvr] ALERT    -- Node 51: Node 2 Disconnected
2012-06-11 11:34:50 [MgmtSrvr] WARNING  -- Failed to allocate nodeid for API at 211.115.91.243. Returned eror: 'No free node id found for mysqld(API).'
2012-06-11 11:34:53 [MgmtSrvr] WARNING  -- Failed to allocate nodeid for API at 211.115.91.243. Returned eror: 'No free node id found for mysqld(API).'
=======================================================================


=======================================================================
현재 재시작이 안될때.
=======================================================================
server.cluster.log
=======================================================================
2012-06-13 04:13:53 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1740 completed
2012-06-13 05:10:50 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1741 started. Keep GCI = 1864446 oldest restorable GCI = 1851904
2012-06-13 05:16:03 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1741 completed
2012-06-13 06:13:14 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1742 started. Keep GCI = 1868013 oldest restorable GCI = 1851904
2012-06-13 06:18:25 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1742 completed
2012-06-13 07:02:07 [MgmtSrvr] ALERT    -- Node 1: Forced node shutdown completed. Occured during startphase 5. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-06-13 07:02:07 [MgmtSrvr] ALERT    -- Node 51: Node 1 Disconnected
2012-06-13 07:02:08 [MgmtSrvr] ALERT    -- Node 2: Node 1 Disconnected
2012-06-13 07:02:08 [MgmtSrvr] INFO    -- Node 2: Communication to Node 1 closed
2012-06-13 07:02:08 [MgmtSrvr] ALERT    -- Node 2: Network partitioning - arbitration required
2012-06-13 07:02:08 [MgmtSrvr] INFO    -- Node 2: President restarts arbitration thread [state=7]
2012-06-13 07:02:08 [MgmtSrvr] ALERT    -- Node 2: Arbitration won - positive reply from node 51
2012-06-13 07:02:08 [MgmtSrvr] INFO    -- Node 2: Removed lock for node 1
2012-06-13 07:02:08 [MgmtSrvr] INFO    -- Node 2: DICT: remove lock by failed node 1 for NodeRestart
2012-06-13 07:02:08 [MgmtSrvr] INFO    -- Node 2: DICT: unlocked by node 1 for NodeRestart
2012-06-13 07:02:09 [MgmtSrvr] INFO    -- Node 2: Started arbitrator node 51 [ticket=3e7700020023f4ec]
2012-06-13 07:02:29 [MgmtSrvr] INFO    -- Node 2: Communication to Node 1 opened
2012-06-13 07:15:16 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1743 started. Keep GCI = 1871601 oldest restorable GCI = 1851904
2012-06-13 07:20:29 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1743 completed
2012-06-13 08:16:18 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1744 started. Keep GCI = 1875158 oldest restorable GCI = 1851904
2012-06-13 08:19:21 [MgmtSrvr] INFO    -- Nodeid 102 allocated for API at 211.115.91.243
2012-06-13 08:19:21 [MgmtSrvr] INFO    -- Node 102: mysqld --server-id=1
2012-06-13 08:19:21 [MgmtSrvr] INFO    -- Node 2: Node 102 Connected
2012-06-13 08:19:21 [MgmtSrvr] INFO    -- Node 2: Node 102: API mysql-5.5.20 ndb-7.2.5
2012-06-13 08:19:27 [MgmtSrvr] INFO    -- Node 2: Node 101 Connected
2012-06-13 08:19:27 [MgmtSrvr] INFO    -- Node 2: Node 101: API mysql-5.5.20 ndb-7.2.5
2012-06-13 08:21:32 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1744 completed
2012-06-13 09:10:22 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1745 started. Keep GCI = 1878674 oldest restorable GCI = 1851904
2012-06-13 09:15:44 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1745 completed
2012-06-13 10:03:22 [MgmtSrvr] INFO    -- Node 2: Local checkpoint 1746 started. Keep GCI = 1881817 oldest restorable GCI = 1881991
royster
ndb 노드의 물리적인 총 메모리 , config.ini 내용 , MGM 노드에서 아래 두가지 출력값 올려주세요.
ndb_mgm> show
ndb_mgm> all report memory
사악쫑이
ndb 노드의 총 물리적인 메모리는 각각 16g 입니다.
===================================================
ndb_mgm> show
Connected to Management Server at: xxx.xxx.xxx.243:1186
Cluster Configuration
---------------------
[ndbd(NDB)]    2 node(s)
id=1    @xxx.xxx.xxx.240  (mysql-5.5.20 ndb-7.2.5, starting, Nodegroup: 0)
id=2    @xxx.xxx.xxx.241  (mysql-5.5.20 ndb-7.2.5, Nodegroup: 0, Master)

[ndb_mgmd(MGM)] 2 node(s)
id=51  @xxx.xxx.xxx.242  (mysql-5.5.20 ndb-7.2.5)
id=52  @xxx.xxx.xxx.243  (mysql-5.5.20 ndb-7.2.5)

[mysqld(API)]  2 node(s)
id=101  @xxx.xxx.xxx.242  (mysql-5.5.20 ndb-7.2.5)
id=102  @xxx.xxx.xxx.243  (mysql-5.5.20 ndb-7.2.5)

ndb_mgm> all report memory
Node 1: Data usage is 21%(69783 32K pages of total 327680)
Node 1: Index usage is 11%(46251 8K pages of total 393248)
Node 2: Data usage is 32%(107102 32K pages of total 327680)
Node 2: Index usage is 14%(55706 8K pages of total 393248)
사악쫑이
config.ini 입니다.
====================================================
# TCP PARAMETERS
#
[tcp default]
SendBufferMemory=10M
ReceiveBufferMemory=10M
#
# Increasing the sizes of these 2 buffers beyond the default values
# helps prevent bottlenecks due to slow disk I/O.
#
# MANAGEMENT NODE PARAMETERS
#
[ndb_mgmd default]
DataDir=/var/mysql-cluster/ndb_data
#
# It is possible to use a different data directory for each management
# server, but for ease of administration it is preferable to be
# consistent.
#
[ndb_mgmd]
HostName=xxx.xxx.xxx.242
NodeId=51
# NodeId=management-server-A-nodeid
#
[ndb_mgmd]
NodeId=52
HostName=xxx.xxx.xxx.243
# NodeId=management-server-B-nodeid
#
# Using 2 management servers helps guarantee that there is always an
# arbitrator in the event of network partitioning, and so is
# recommended for high availability. Each management server must be
# identified by a HostName. You may for the sake of convenience specify
# a NodeId for any management server, although one will be allocated
# for it automatically; if you do so, it must be in the range 1-255
# inclusive and must be unique among all IDs specified for cluster
# nodes.
#
# DATA NODE PARAMETERS
#
[ndbd default]
NoOfReplicas=2
FileSystemPathDD=/home/mysql/data
FileSystemPathDataFiles=/var/mysql-cluster/ndb_data/dndata
FileSystemPathUndoFiles=/var/mysql-cluster/ndb_data/dnlogs
DataDir=/var/mysql-cluster/ndb_data
InitialLogFileGroup = name=LG1; undo_buffer_size=64M; undo1.log:1G; undo2.log:1G; undo3.log:1G; undo4.log:1G;
InitialTablespace = name=TS1; extent_size=8M; data1.dat:10G; data2.dat:10G; data3.dat:10G; data4.dat:10G; data5.dat:10G; data6.dat:10G; data7.dat:10G; data8.dat:10G; data9.dat:10G; data10.dat:10G; data11.dat:10G; data12.dat:10G; data13.dat:10G; data14.dat:10G; data15.dat:10G; data16.dat:10G; data17.dat:10G; data18.dat:10G; data19.dat:10G; data20.dat:10G;
#
# Using 2 replicas is recommended to guarantee availability of data;
# using only 1 replica does not provide any redundancy, which means
# that the failure of a single data node causes the entire cluster to
# shut down. We do not recommend using more than 2 replicas, since 2 is
# sufficient to provide high availability, and we do not currently test
# with greater values for this parameter.
#
#LockPagesInMainMemory=0
#
# On Linux and Solaris systems, setting this parameter locks data node
# processes into memory. Doing so prevents them from swapping to disk,
# which can severely degrade cluster performance.
#
DataMemory=10240M
IndexMemory=3072M
StringMemory=10
# Transaction Parameters #
MaxNoOfConcurrentTransactions=8588
MaxNoOfConcurrentOperations=1000000
MaxNoOfLocalOperations=1100000
# Transaction Temporary Storage #
MaxNoOfConcurrentIndexOperations=4096000
MaxNoOfFiredTriggers=1000
TransactionBufferMemory=3M
# Scans and buffering #
MaxNoOfConcurrentScans=256
MaxNoOfLocalScans=512
#BatchSizePerLocalScan=992
LongMessageBuffer=1M
TimeBetweenLocalCheckpoints=20
TransactionInactiveTimeout=0
TransactionDeadLockDetectionTimeOut=10000
#
# The values provided for DataMemory and IndexMemory assume 4 GB RAM
# per data node. However, for best results, you should first calculate
# the memory that would be used based on the data you actually plan to
# store (you may find the ndb_size.pl utility helpful in estimating
# this), then allow an extra 20% over the calculated values. Naturally,
# you should ensure that each data node host has at least as much
# physical memory as the sum of these two values.
#
# ODirect=1
#
# Enabling this parameter causes NDBCLUSTER to try using O_DIRECT
# writes for local checkpoints and redo logs; this can reduce load on
# CPUs. We recommend doing so when using MySQL Cluster NDB 6.2.3 or
# newer on systems running Linux kernel 2.6 or later.
#
#NoOfFragmentLogFiles=300
#DataDir=/var/mysql-cluster/ndb_data
#MaxNoOfConcurrentOperations=100000
NoOfFragmentLogFiles=300
FragmentLogFileSize=512M
MaxNoOfOpenFiles=255
InitialNoOfOpenFiles=27
MaxNoOfSavedMessages=25
#
SchedulerSpinTimer=400
SchedulerExecutionTimer=100
RealTimeScheduler=1
# Setting these parameters allows you to take advantage of real-time scheduling
# of NDBCLUSTER threads (introduced in MySQL Cluster NDB 6.3.4) to get higher
# throughput.
#
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=200
DiskCheckpointSpeed=10M
DiskCheckpointSpeedInRestart=100M
RedoBuffer=32M
#
# CompressedLCP=1
# CompressedBackup=1
# Enabling CompressedLCP and CompressedBackup causes, respectively, local
# checkpoint files and backup files to be compressed, which can result in a space
# savings of up to 50% over noncompressed LCPs and backups.
#
# MaxNoOfLocalScans=64
#MaxNoOfTables=1024
#MaxNoOfOrderedIndexes=256
# Metadata Objects #
MaxNoOfAttributes=1500
MaxNoOfTables=200
MaxNoOfOrderedIndexes=512
MaxNoOfUniqueHashIndexes=256
MaxNoOfTriggers=100
#
[ndbd]
HostName=xxx.xxx.xxx.240
NodeId=1
# NodeId=data-node-A-nodeid
#
# On systems with multiple CPUs, these parameters can be used to lock NDBCLUSTER
# threads to specific CPUs
#
[ndbd]
HostName=xxx.xxx.xxx.241
NodeId=2
# NodeId=data-node-B-nodeid
#
#
# You must have an [ndbd] section for every data node in the cluster;
# each of these sections must include a HostName. Each section may
# optionally include a NodeId for convenience, but in most cases, it is
# sufficient to allow the cluster to allocate node IDs dynamically. If
# you do specify the node ID for a data node, it must be in the range 1
# to 48 inclusive and must be unique among all IDs specified for
# cluster nodes.
#
# SQL NODE / API NODE PARAMETERS
#
[mysqld]
HostName=xxx.xxx.xxx.242
NodeId=101
# HostName=sql-node-A-hostname
# NodeId=sql-node-A-nodeid
#
[mysqld]
HostName=xxx.xxx.xxx.243
NodeId=102
#
#
#
# Each API or SQL node that connects to the cluster requires a [mysqld]
# or [api] section of its own. Each such section defines a connection
# “slot”; you should have at least as many of these sections in the
# config.ini file as the total number of API nodes and SQL nodes that
# you wish to have connected to the cluster at any given time. There is
# no performance or other penalty for having extra slots available in
# case you find later that you want or need more API or SQL nodes to
# connect to the cluster at the same time.
# If no HostName is specified for a given [mysqld] or [api] section,
# then any API or SQL node may use that slot to connect to the
# cluster. You may wish to use an explicit HostName for one connection slot
# to guarantee that an API or SQL node from that host can always
# connect to the cluster. If you wish to prevent API or SQL nodes from
# connecting from other than a desired host or hosts, then use a
# HostName for every [mysqld] or [api] section in the config.ini file.
# You can if you wish define a node ID (NodeId parameter) for any API or
# SQL node, but this is not necessary; if you do so, it must be in the
# range 1 to 255 inclusive and must be unique among all IDs specified
# for cluster nodes.
royster
NDBD Memory 스와핑 의심되네요~~!
IndexMemory 3072M 에서 2000M정도로 줄이구요..
DataMemory 10240M 에서 8000M정도로 줄이구요...
MaxNoOfConcurrentTransactions 20000 / MaxNoOfConcurrentScans 350~400 정도로 변경 및 반영후에  ndbd node 다시 올려보세요...
사악쫑이
아.. 메모리 스와핑이라는 것이 있군요..
적용해보도록 하겠습니다.
많이 배워 갑니다.. 고맙습니다..^^
민족
이거 해결 되셨나여?  error 로그 보니 timeout 문제네요~
민족
cluster 디비 구성은 각 엔진마다 각 서버로 구성 되신건지? 아니면 2개의 서버에 엔진이 올라가 있는지도 확인해주세여
이전글 메모리 질문과 그외 질문드립니다. 
다음글 정말 답답해서 다시 올립니다. 
MySQL Korea 사이트의 컨텐츠 소유권은 (주)상상이비즈에 있으므로 무단전재를 금합니다.
Copyright ⓒ ssebiz All Rights Reserved.